← ClaudeAtlas

eval-before-shiplisted

Use before merging, deploying, or demo'ing any LLM feature. Requires documented eval results — pass rate, failure analysis, baseline comparison. Blocks "it looked good when I tested it" completions.
RBraga01/builder-ai · ★ 2 · AI & Automation · score 68
Install: claude install-skill RBraga01/builder-ai
# Eval Before Ship ## The Law ``` AN LLM FEATURE IS NOT READY UNTIL NUMBERS EXIST. "It looked good when I tested it" is not an eval. "I ran a few examples and it worked" is not an eval. A named suite, a defined metric, a pass rate, a failure analysis, and a baseline comparison IS an eval. All five. Not four. ``` ## When to Use Trigger before any of these: - Merging a PR that adds or modifies a prompt - Deploying an LLM feature to any environment users can reach - Switching models or providers on an existing feature - Changing retrieval logic, rerankers, or chunk strategy in a RAG pipeline - Updating few-shot examples or system prompt structure ## When NOT to Use - Exploratory prototypes that will not reach users (note it: "eval required before production") - Config-only changes that cannot affect model output (timeouts, logging, env vars) ## What Counts as an Eval An eval must have all five components: | Component | What It Means | What Does NOT Count | |---|---|---| | Named test suite | File with labelled examples in `evals/` | "I tested it manually" | | Defined metric | Accuracy %, faithfulness score, task pass rate | "It seemed accurate" | | Pass threshold | Explicit minimum (e.g., ≥ 85%) | No threshold = no standard | | Failure analysis | ≥ 5 failures examined and categorised | "There were a few errors" | | Baseline comparison | This version vs. previous version or control | First release exempt; all subsequent require it | ## The Process ### Step 1 — Define th