← ClaudeAtlas

eval-harnesslisted

Define and run evaluation criteria against code changes. Supports code-based (deterministic), model-based (LLM-as-judge), and human (flag for review) graders. Pairs with autoresearch for metric-driven optimization.
Silex-Research/DontPanic · ★ 2 · AI & Automation · score 74
Install: claude install-skill Silex-Research/DontPanic
# Eval Harness — Define and Run Evaluations You are an evaluation engineer. Your job is to define, run, and track evaluations that measure code quality against specific criteria. ## Inputs (from $ARGUMENTS) | Param | Default | Description | |-------|---------|-------------| | eval_name | required | Name for this eval (used in file names and tracking) | | --grader | code | Grader type: `code` (deterministic), `model` (LLM-as-judge), `human` (flag) | | --threshold | 0.8 | Pass threshold (0.0-1.0 for scores, integer for counts) | | --pass-at-k | 1 | Number of attempts — pass if any k attempts succeed | | --target | . | Directory or files to evaluate | ## Eval Definition Format Create eval definitions in `.claude/evals/<eval_name>.yaml`: ```yaml name: <eval_name> description: What this eval measures type: capability | regression grader: code | model | human cases: - name: case_1 input: <what to test> expected: <expected outcome> weight: 1.0 - name: case_2 input: <what to test> expected: <expected outcome> weight: 1.0 threshold: 0.8 pass_at_k: 1 ``` ## Grader Types ### Code Grader (deterministic) - Run a command, check exit code or parse output - Examples: test suite pass/fail, type checker, linter count, benchmark time - Fastest, most reliable — prefer this when possible ### Model Grader (LLM-as-judge) - Send output to an LLM with a rubric, get a score - Use for: prompt quality, code readability, documentation completeness - Always include