← ClaudeAtlas

eval-driven-devlisted

Build the evaluation discipline that separates production agentic products from demos — error analysis on real traces, the three-level eval pyramid (code assertions / LLM-as-judge / human review), binary judge outputs calibrated against human labels, and CI gates that block regression. Based on the Husain/Shankar methodology. Use whenever the user mentions evals, evaluation, LLM-as-judge, hallucination testing, regression testing for AI, quality measurement, error analysis, "how do I know if my agent works," failure modes, or grading agent outputs.
AlexDuchDev/agentic-product-standard · ★ 5 · AI & Automation · score 77
Install: claude install-skill AlexDuchDev/agentic-product-standard
# Eval-Driven Development Evaluation is the most critical and most under-invested practice in building agentic products. Hamel Husain and Shreya Shankar have codified the discipline; following it separates teams that ship from teams that demo. The core insight from Husain's "Field Guide to Rapidly Improving AI Products" (after helping 30+ AI products): **the teams who succeed barely talk about tools. They obsess over measurement and iteration.** ## First principle: error analysis before infrastructure Most teams reach for eval infrastructure (Braintrust, LangSmith, etc.) before they know what to measure. This is backwards. **Start by reading production traces.** Read 20–50 real outputs manually after each meaningful change. Write down what went wrong in plain language. Cluster the failure modes into 5–10 named buckets. These named buckets are your eval categories — generic "helpfulness" never catches them. Common failure mode buckets (yours will be different and product-specific): - "Missed human handoff" (agent should have escalated, didn't) - "Wrong tool selection" (chose web_search when should have used internal docs) - "Stale information" (used cached/old data when fresh was required) - "Lost context across compaction" (forgot user's earlier constraint) - "Hallucinated citation" (made up a source URL) Each named failure mode becomes an eval. Generic evals do not. ## The three-level eval pyramid ``` ▲ ╱ ╲ Level 3: Human Review ╱ ╲ Major