sre-engineer
SolidDefines service level objectives, creates error budget policies, designs incident response procedures, develops capacity models, and produces monitoring configurations and automation scripts for production systems. Use when defining SLIs/SLOs, managing error budgets, building reliable systems at scale, incident management, chaos engineering, toil reduction, or capacity planning.
Install
Quality Score: 94/100
Skill Content
Details
- Author
- Jeffallan
- Repository
- Jeffallan/claude-skills
- Created
- 7 months ago
- Last Updated
- 3 weeks ago
- Language
- Python
- License
- MIT
Similar Skills
Semantically similar based on skill content — not just same category
sre-engineer
SRE / Observability Engineer (/sre) — reliability engineering: SLOs/SLIs & error budgets, monitoring & alerting (Prometheus, Grafana, OpenTelemetry), incident response & runbooks, on-call, capacity & load, chaos/resilience, and post-incident reviews. Use when defining reliability targets, instrumenting observability, setting up alerting, writing runbooks, doing incident response, or reviewing a change for production readiness. Invoke alongside /arch for reliability NFRs and devops-engineer for the underlying infra/CI-CD. NOT for provisioning infra or pipelines (that's devops-engineer) — /sre owns reliability, not the cluster.
sre-patterns
Provides Site Reliability Engineering best practices for SLOs, SLIs, SLAs, error budgets, toil reduction, reliability reviews, and capacity planning. Use when defining service objectives, measuring reliability, reducing toil, planning capacity, or when user mentions 'SRE', 'SLO', 'SLI', 'SLA', 'error budget', 'toil', 'reliability', 'on-call', 'capacity planning'.
operating-production-services
SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).