resilience-failurelisted
Install: claude install-skill proyecto26/system-design-skills
# Resilience & Failure
Design the system so that when a part breaks — and it will — the failure is
contained and the user still gets a useful (if degraded) answer instead of an
error page or a cascading outage. Getting this wrong is the difference between a
slow dependency and a total meltdown: the most common amplifier of an outage is
the system's own reaction to it (retry storms, health-check stampedes).
## When to reach for this
Any design with a remote dependency, a shared resource, or an SLA. Reach here to
find single points of failure, decide what each call does when its dependency is
slow or down, protect a service from being overwhelmed (rate limiting), and plan
how a recovered service comes back without being crushed by the backlog.
## When NOT to
Don't wrap a single in-process function or a best-effort batch job in circuit
breakers and bulkheads — that's machinery for cross-process/cross-network calls
(YAGNI). Don't add retries to a non-idempotent write without an idempotency key
first (→ `api-design`) — you'll duplicate side effects. The cheapest design that
meets the availability target wins; chasing an extra nine you don't need costs
real complexity (→ `back-of-the-envelope` for what a nine actually buys).
## Clarify first
- **Availability target** — how many nines, and is it per-request or per-feature? (→ `back-of-the-envelope`.)
- **Blast radius** — if this dependency dies, must the whole request fail, or can the feature degrade or hide?
- **Idempotency** —