managing-incidentslisted
Install: claude install-skill ancoleman/ai-design-components
# Incident Management
Provide end-to-end incident management guidance covering detection, response, communication, and learning. Emphasizes SRE culture, blameless post-mortems, and structured processes for high-reliability operations.
## When to Use This Skill
Apply this skill when:
- Setting up incident response processes for a team
- Designing on-call rotations and escalation policies
- Creating runbooks for common failure scenarios
- Conducting blameless post-mortems after incidents
- Implementing incident communication protocols (internal and external)
- Choosing incident management tooling and platforms
- Improving MTTR and incident frequency metrics
## Core Principles
### Incident Management Philosophy
**Declare Early and Often:** Do not wait for certainty. Declaring an incident enables coordination, can be downgraded if needed, and prevents delayed response.
**Mitigation First, Root Cause Later:** Stop customer impact immediately (rollback, disable feature, failover). Debug and fix root cause after stability restored.
**Blameless Culture:** Assume good intentions. Focus on how systems failed, not who failed. Create psychological safety for honest learning.
**Clear Command Structure:** Assign Incident Commander (IC) to own coordination. IC delegates tasks but does not do hands-on debugging.
**Communication is Critical:** Internal coordination via dedicated channels, external transparency via status pages. Update stakeholders every 15-30 minutes during critical