kubectl-investigatorlisted
Install: claude install-skill anyshift-io/sre-skills
# kubectl-investigator
Methodology skill for investigating a live or recent incident on **Kubernetes**. Produces a timeline, a ranked set of hypotheses, a blast-radius estimate, and a recommended mitigation. Hands off cleanly to `postmortem-author` once the incident is mitigated.
Scope: workloads running on Kubernetes (Deployments, StatefulSets, DaemonSets, Jobs/CronJobs) and the cluster primitives around them (Services, Ingress, CoreDNS, ConfigMaps/Secrets, RBAC, HPA, nodes). External dependencies (third-party APIs, partner TLS endpoints, managed databases) are in scope only as seen *from* a Kubernetes workload — the methodology investigates the cluster-side symptom and the in-cluster change surface.
## When to invoke
- A `PrometheusRule` / Alertmanager alert just fired on a workload and the agent needs to triage before paging a human.
- A user asks "what is breaking in the cluster right now" or "why did Deployment X just page".
- A `kubectl rollout` / Helm release / Argo CD sync went out in the last hour and a metric moved; need to know whether they are linked.
- Pods are crash-looping, `OOMKilled`, or `Pending`, or customer impact is reported with no alert yet; need to find the failing surface.
## The methodology, in order
The order matters. Skipping a step produces confident wrong answers.
### 1. Anchor the window
Lock two timestamps before doing anything else:
- **T0**: the trigger timestamp. Apply this order:
1. **If an alert is provided as the trigger, T0 =