agent-eval-designlisted

Use when designing evaluations for AI agents, skills, routers, prompts, tool-use policies, or multi-step workflows: task sets, rubrics, graders, hard negatives, regression cases, traces, and acceptance thresholds. Do NOT use for application test planning (use `testing-strategy`), skill-library health tooling (use `skill-infrastructure`), or live debugging of a failed run (use `debugging`).
jacob-balslev/skill-graph · ★ 0 · AI & Automation · score 66

Install: claude install-skill jacob-balslev/skill-graph

# Agent Eval Design ## Coverage Design evaluations for agent behavior, skill routing, prompt systems, tool-use policies, and multi-step workflows. Covers task selection, expected behavior, rubrics, graders, hard negatives, trace capture, regression cases, thresholds, coverage, and eval maintenance. ## Philosophy Agent evals are behavioral contracts. They should measure whether the agent does the right thing under realistic ambiguity, not whether it can parrot the happy path. The highest-value cases are hard negatives and prior failures. A routing eval with only obvious positives gives false confidence. ## Method 1. Define the behavior being evaluated in one sentence. 2. Collect realistic positive cases, near misses, and failure traces. 3. Write expected outcomes that are observable. 4. Add hard negatives that should route elsewhere or refuse an unsafe path. 5. Choose grader type: exact, rubric, trace inspection, artifact check, or hybrid. 6. Set pass thresholds and severity for failures. 7. Add regression cases whenever a real agent failure is fixed. ## Verification - [ ] Eval cases include positives, hard negatives, and prior failures - [ ] Expected outcomes are observable and not preference-only - [ ] The grader can distinguish partially correct from wrong - [ ] Thresholds match risk, not vanity metrics - [ ] Cases cover routing, grounding, tool use, and final artifact where relevant - [ ] New failures become regression cases - [ ] Eval metadata honestly reflects r