← ClaudeAtlas

nexus-observabilitylisted

Use for correlated API failures, cascading errors, and distributed-system incident tracing. Trigger on multi-service error spikes, dependency-chain analysis, circuit-breaker events, or requests to identify origin service and blast radius from logs/metrics across components. When in doubt, use this skill.
aayushostwal/nexus · ★ 10 · AI & Automation · score 76
Install: claude install-skill aayushostwal/nexus
# Nexus Observability — API Failure Correlation Engine Systematically correlate failures across distributed services to identify causal chains, blast radius, and the origin service — before proposing any fix. --- ## Compatibility - Supporting files: `checklists/investigation-checklist.md`, `anti-patterns/common-mistakes.md`, `validation/output-validation.md` - Required tools: Read, Bash, Grep - Optional tools: WebSearch (vendor-specific error lookups) - Hands off to: `nexus:debugging` once origin service is identified; `nexus:planning` after fix is agreed upon --- ## Core Principle **Never assign blame before building the full cross-service timeline.** A cascade always has one origin — fixing a victim while the cause is live means the failure recurs. --- ## Workflow ### Step 1 — Context Acquisition Collect before reading any logs. Require items 1–5 minimum; ask for all in one message: | # | Collect | Why | |---|---------|-----| | 1 | Verbatim error logs from **all** affected services | Paraphrased logs lose exact timestamps and error codes | | 2 | Distributed traces (Jaeger/Zipkin/X-Ray/Tempo) for ≥3 failing requests | Shows exact call path and which hop introduced error/latency | | 3 | Metrics: error rate, p99 latency, RPS, CPU/mem/conn-pool per service | Distinguishes saturation from errors from cascades | | 4 | Exact first-elevated-error timestamp per service | Required for timeline in Step 2 | | 5 | Dependency map (direct + indirect call relationships) | Requir