resilience-failurelisted

This skill should be used when the user asks about "fault tolerance", "resilience", a "circuit breaker", "graceful degradation", "retry storm" or "thundering herd on recovery", "exponential backoff with jitter", "timeout", "bulkhead", a "single point of failure" (SPOF), "failover", or "rate limiting" (token bucket / leaky bucket / sliding window). Use it whenever a design must keep working through node crashes, slow dependencies, traffic spikes, or partial outages — i.e. any time the answer to "what happens when this breaks?" is missing, even if the user doesn't say "resilience".
proyecto26/system-design-skills · ★ 6 · AI & Automation · score 76

Install: claude install-skill proyecto26/system-design-skills

# Resilience & Failure Design the system so that when a part breaks — and it will — the failure is contained and the user still gets a useful (if degraded) answer instead of an error page or a cascading outage. Getting this wrong is the difference between a slow dependency and a total meltdown: the most common amplifier of an outage is the system's own reaction to it (retry storms, health-check stampedes). ## When to reach for this Any design with a remote dependency, a shared resource, or an SLA. Reach here to find single points of failure, decide what each call does when its dependency is slow or down, protect a service from being overwhelmed (rate limiting), and plan how a recovered service comes back without being crushed by the backlog. ## When NOT to Don't wrap a single in-process function or a best-effort batch job in circuit breakers and bulkheads — that's machinery for cross-process/cross-network calls (YAGNI). Don't add retries to a non-idempotent write without an idempotency key first (→ `api-design`) — you'll duplicate side effects. The cheapest design that meets the availability target wins; chasing an extra nine you don't need costs real complexity (→ `back-of-the-envelope` for what a nine actually buys). ## Clarify first - **Availability target** — how many nines, and is it per-request or per-feature? (→ `back-of-the-envelope`.) - **Blast radius** — if this dependency dies, must the whole request fail, or can the feature degrade or hide? - **Idempotency** —