rcalisted
Install: claude install-skill bjornjee/agent-dashboard
Root cause analysis for a system-level failure. **Gather ALL evidence before reasoning about the cause.**
Incident description: $ARGUMENTS
## Instructions
Follow these phases strictly in order. Do NOT speculate or reason about root cause until Phase 5. Every phase has a gate.
---
### Phase 1: Scope the Incident
1. Parse the incident description — what died? (process, tmux server, container, service, etc.)
2. Establish the **time window**: ask the user or derive from context when the failure was noticed and when things last worked.
3. Identify what was running at the time — check:
- `tmux list-sessions` / `tmux list-panes -a` (if tmux is back up)
- Shell history with timestamps to reconstruct user activity:
```
tail -500 ~/.zsh_history | while IFS= read -r line; do
if echo "$line" | grep -q "^: [0-9]"; then
ts=$(echo "$line" | sed 's/^: \([0-9]*\):.*/\1/')
cmd=$(echo "$line" | sed 's/^: [0-9]*:[0-9]*;//')
dt=$(date -r "$ts" "+%Y-%m-%d %H:%M:%S" 2>/dev/null)
echo "$dt $cmd"
fi
done
```
- Filter to the time window around the incident.
4. Identify **all active agents/processes** — check Codex session logs:
```
ls -lt "${CODEX_HOME:-$HOME/.codex}"/sessions/*/*/*/rollout-*.jsonl 2>/dev/null | head -15
```
Cross-reference modification times with the incident window.
**Gate:** Time window is established. List of active processes/sessions at the time is known.
---
### Phase 2: System-