sre-patternslisted
Install: claude install-skill Tibsfox/gsd-skill-creator
# SRE Patterns
Best practices for building and operating reliable systems using Site Reliability Engineering principles.
## SLO / SLI / SLA Definitions
These three concepts form the foundation of SRE. They are distinct and frequently confused.
| Concept | Definition | Owner | Example |
|---------|-----------|-------|---------|
| **SLI** (Service Level Indicator) | A quantitative measurement of a service attribute | Engineering | 99.2% of requests completed in < 300ms |
| **SLO** (Service Level Objective) | A target value or range for an SLI | Engineering + Product | 99.5% of requests must complete in < 300ms |
| **SLA** (Service Level Agreement) | A contract with consequences for missing an SLO | Business + Legal | 99.9% uptime or customer receives service credits |
### Relationship
```
SLI (what you measure)
--> SLO (what you target, always stricter than SLA)
--> SLA (what you promise externally, with penalties)
```
**Key rule:** SLO must be stricter than SLA. If your SLA promises 99.9% uptime, your internal SLO should target 99.95%. The gap is your safety margin.
## SLI Specification
SLIs must be precise, measurable, and tied to user experience. Vague indicators lead to meaningless objectives.
### SLI Types by Service Category
| Service Type | SLI Category | Good Event | Valid Event |
|-------------|-------------|------------|-------------|
| Request-driven | Availability | Response status < 500 | All HTTP requests |
| Request-driven | Latency | Response time