observability-srelisted
Install: claude install-skill majiayu000/claude-arsenal
# Observability & Site Reliability Engineering
## Core Principles
- **Three Pillars** — Metrics, Logs, and Traces provide holistic visibility
- **Observability-First** — Build systems that explain their own behavior
- **SLO-Driven** — Define reliability targets that matter to users
- **Proactive Detection** — Find issues before customers do
- **Blameless Culture** — Learn from failures without blame
- **Automate Toil** — Reduce repetitive operational work
- **Continuous Improvement** — Each incident makes systems more resilient
- **Full-Stack Visibility** — Monitor from infrastructure to business metrics
---
## Hard Rules (Must Follow)
> These rules are mandatory. Violating them means the skill is not working correctly.
### Symptom-Based Alerts Only
**Alert on user-facing symptoms, not internal infrastructure metrics.**
```yaml
# ❌ FORBIDDEN: Alerting on internal metrics
- alert: CPUHigh
expr: cpu_usage > 70%
# Users don't care about CPU, they care about latency
- alert: MemoryHigh
expr: memory_usage > 80%
# Internal metric, may not affect users
# ✅ REQUIRED: Alert on user experience
- alert: APILatencyHigh
expr: slo:api_latency:p95 > 0.200
annotations:
summary: "Users experiencing slow response times"
- alert: ErrorRateHigh
expr: slo:api_errors:rate5m > 0.001
annotations:
summary: "Users encountering errors"
```
### Low Cardinality Labels
**Loki/Prometheus labels must have low cardinality (<10 unique labels).**
```yaml
# ❌ FORBIDDEN: Hig