← ClaudeAtlas

kubectl-investigatorlisted

Investigate a live or recent incident in a Kubernetes cluster. Anchor the window, bisect the change surface (rollouts, ConfigMaps/Secrets, RBAC, HPA/cluster changes, CronJobs), classify against four reference failure paths (OOM, DNS, cascading-failure, deploy-correlator), confirm the hypothesis with three independent signals, quantify blast radius, and propose mitigation before root cause. Use whenever an agent is asked "what is breaking in the cluster right now", "why did this pod/Deployment just page", "did the rollout cause Z", or to triage an active Kubernetes incident. Vendor-neutral by default (works with kubectl, kube-state-metrics, and whatever telemetry you have); an opt-in Anyshift integration is documented separately.
anyshift-io/sre-skills · ★ 13 · AI & Automation · score 80
Install: claude install-skill anyshift-io/sre-skills
# kubectl-investigator Methodology skill for investigating a live or recent incident on **Kubernetes**. Produces a timeline, a ranked set of hypotheses, a blast-radius estimate, and a recommended mitigation. Hands off cleanly to `postmortem-author` once the incident is mitigated. Scope: workloads running on Kubernetes (Deployments, StatefulSets, DaemonSets, Jobs/CronJobs) and the cluster primitives around them (Services, Ingress, CoreDNS, ConfigMaps/Secrets, RBAC, HPA, nodes). External dependencies (third-party APIs, partner TLS endpoints, managed databases) are in scope only as seen *from* a Kubernetes workload — the methodology investigates the cluster-side symptom and the in-cluster change surface. ## When to invoke - A `PrometheusRule` / Alertmanager alert just fired on a workload and the agent needs to triage before paging a human. - A user asks "what is breaking in the cluster right now" or "why did Deployment X just page". - A `kubectl rollout` / Helm release / Argo CD sync went out in the last hour and a metric moved; need to know whether they are linked. - Pods are crash-looping, `OOMKilled`, or `Pending`, or customer impact is reported with no alert yet; need to find the failing surface. ## The methodology, in order The order matters. Skipping a step produces confident wrong answers. ### 1. Anchor the window Lock two timestamps before doing anything else: - **T0**: the trigger timestamp. Apply this order: 1. **If an alert is provided as the trigger, T0 =