← ClaudeAtlas

nexus-reliabilitylisted

Use for production incidents, outages, and service degradations, plus release-readiness gate checks. Trigger on 500 spikes, SLA breaches, latency/error surges, on-call alerts, rollback decisions, and RCA/post-mortem requests. Prioritize stabilization, timeline reconstruction, and evidence-led actions. When in doubt, use this skill.
aayushostwal/nexus · ★ 10 · AI & Automation · score 76
Install: claude install-skill aayushostwal/nexus
# Nexus Production Incident Investigator Time-boxed, evidence-driven workflow for investigating production incidents and evaluating release readiness. --- ## Compatibility - Sub-skills: `release-readiness.md` | Materials: `checklists/incident-checklist.md`, `anti-patterns/common-mistakes.md`, `validation/output-validation.md` - Required: Read, Bash, Grep | Optional: WebSearch | Hands off to: `nexus:debugging` (code RCA), `nexus:planning` (post-mortem actions) --- ## Core Principle **Stabilize first. Investigate second. Never skip the timeline.** --- ## Incident Classification | Severity | Definition | Response target | |----------|-----------|-----------------| | **P0** | Complete outage or data loss affecting all/most users | Immediate; page all on-call | | **P1** | Major feature broken or >25% requests failing | < 5 min acknowledgement; page on-call lead | | **P2** | Degraded performance or partial failure | < 15 min acknowledgement; one engineer | | **P3** | Minor anomaly, no user impact | Next business hour | If you cannot classify immediately, default to P1 and downgrade after Phase 1. --- ## Investigation Workflow ### Phase 1 — Triage (0–5 min) 1. **Confirm the signal** — cross-check alert against at least one independent source (logs, dashboard, manual repro). Never declare from a single data point. 2. **Classify severity** using the table above. 3. **Identify blast radius** — which services, user segments, and whether data integrity is at risk. 4. **Open