principle-resiliencylisted
Install: claude install-skill lugassawan/swe-workbench
# Resiliency
Distributed systems fail partially, not totally. Resilience is the discipline of staying useful when components, networks, or dependencies degrade.
## Failure Domains
A failure domain is the set of components that fail together. Name failure domains before designing for them — unnamed domains produce unnamed blast radii.
- **Crash failure** — process exits; detectable immediately by the load balancer or orchestrator.
- **Slow failure** — process responds but takes too long; the most dangerous mode. Threads and connections fill; the caller eventually crashes too.
- **Gray/Byzantine failure** — process returns wrong data or errors intermittently; hardest to detect.
- **Partial failure** — some instances or shards fail while others serve normally.
Cascading failure: a degraded dependency holds resources long enough that the caller exhausts its own pools, propagating failure upstream. Root cause is almost always unbounded resource sharing across failure domains.
## Bulkheads
Named after ship compartments: isolate resource pools so a breach in one dependency does not exhaust resources for all others.
- Allocate a separate bounded connection pool, semaphore, or thread pool per downstream dependency.
- Never share a single pool across unrelated dependencies — a slow third-party API must not starve database connections.
- In multi-tenant systems, partition queues or workers per tenant to contain noisy-neighbor starvation.
- Size each bulkhead to the dependency's