principle-resiliencylisted

Resiliency principles — fault tolerance, resilience, partial failure, blast radius, failure domains, bulkheads, resource isolation, graceful degradation, fail-fast vs fail-soft, health checks, liveness vs readiness probes, cascading failure, gray failure, fault isolation, idempotency, idempotency keys, safe retry, deduplication, exactly-once, rate limiting, throttling, quotas, backpressure, token bucket, leaky bucket, jitter, thundering herd, synchronized retries, 429. Auto-load when designing for partial failure, isolating dependencies via bulkheads, planning graceful degradation, choosing fail-fast vs fail-soft, configuring health/readiness/liveness probes, evaluating cascading failure risk, designing fallback paths, reviewing system-level fault tolerance, making operations safely retryable with idempotency keys, implementing dedup stores, designing rate limiters or token buckets, handling 429 responses and backpressure, adding jitter to avoid thundering herd retry storms and synchronized retry spikes, or p
lugassawan/swe-workbench · ★ 3 · Code & Development · score 69

Install: claude install-skill lugassawan/swe-workbench

# Resiliency Distributed systems fail partially, not totally. Resilience is the discipline of staying useful when components, networks, or dependencies degrade. ## Failure Domains A failure domain is the set of components that fail together. Name failure domains before designing for them — unnamed domains produce unnamed blast radii. - **Crash failure** — process exits; detectable immediately by the load balancer or orchestrator. - **Slow failure** — process responds but takes too long; the most dangerous mode. Threads and connections fill; the caller eventually crashes too. - **Gray/Byzantine failure** — process returns wrong data or errors intermittently; hardest to detect. - **Partial failure** — some instances or shards fail while others serve normally. Cascading failure: a degraded dependency holds resources long enough that the caller exhausts its own pools, propagating failure upstream. Root cause is almost always unbounded resource sharing across failure domains. ## Bulkheads Named after ship compartments: isolate resource pools so a breach in one dependency does not exhaust resources for all others. - Allocate a separate bounded connection pool, semaphore, or thread pool per downstream dependency. - Never share a single pool across unrelated dependencies — a slow third-party API must not starve database connections. - In multi-tenant systems, partition queues or workers per tenant to contain noisy-neighbor starvation. - Size each bulkhead to the dependency's