observabilitylisted

This skill should be used when the user asks about "observability" or "monitoring", what "metrics, logs, and traces" to collect, "health checks" (liveness/readiness), "alerting" or "on-call", "SLO/SLI" or "error budgets", the "RED" or "USE" method, "dashboards", or names a tool like "Prometheus", "Grafana", or "Datadog". Use it whenever a design has no answer to "how would we know this is broken?" or "what do we alert on?" — i.e. any time failure would be invisible until users complain, even if the user doesn't say "observability".
proyecto26/system-design-skills · ★ 6 · Data & Documents · score 76

Install: claude install-skill proyecto26/system-design-skills

# Observability Decide *what to measure* so a system can be seen, alerted on, and debugged in production. Getting this wrong is failure mode #6 — ignoring failure: a design that works on the whiteboard but goes dark under load, where the first signal of an outage is a user complaint instead of a page. ## When to reach for this Any production design needs an answer to "how would we know this broke, and how fast?" Reach for this when defining what the system measures, what pages a human, what an acceptable level of service is (SLO), or how a request is traced across services. It is the design move that makes every *other* block's stress section real — you cannot mitigate a thundering herd or a hot shard you can't see. ## When NOT to Do not build a full metrics-logs-traces stack for a prototype or an internal tool with no users to disappoint (YAGNI) — a health check and error logging are enough. Do not invent SLOs nobody will defend, or wire alerts before knowing the symptom that matters; an alert with no owner and no runbook is noise that trains the team to ignore pages. This skill owns *what* to measure and alert on; the **high-volume log pipeline** (collect → buffer → ship → index → retain) lives in `distributed-logging` — summarize and link, don't rebuild it here. ## Clarify first - **What is "healthy" from a user's view?** The symptom that defines a bad experience (slow checkout, failed upload) — alerts target this, not CPU. - **SLO target and window?** e.g. 99.9% of