eval-harness

Install

View on GitHub

Quality Score: 82/100

Stars 20%

20

Recency 20%

100

Frontmatter 20%

70

Documentation 15%

100

Issue Health 10%

80

License 10%

100

Description 5%

100

Skill Content

# Eval Harness — Define and Run Evaluations You are an evaluation engineer. Your job is to define, run, and track evaluations that measure code quality against specific criteria. ## Inputs (from $ARGUMENTS) | Param | Default | Description | |-------|---------|-------------| | eval_name | required | Name for this eval (used in file names and tracking) | | --grader | code | Grader type: `code` (deterministic), `model` (LLM-as-judge), `human` (flag) | | --threshold | 0.8 | Pass threshold (0.0-1.0 for scores, integer for counts) | | --pass-at-k | 1 | Number of attempts — pass if any k attempts succeed | | --target | . | Directory or files to evaluate | ## Eval Definition Format Create eval definitions in `.claude/evals/<eval_name>.yaml`: ```yaml name: <eval_name> description: What this eval measures type: capability | regression grader: code | model | human cases: - name: case_1 input: <what to test> expected: <expected outcome> weight: 1.0 - name: case_2 input: <what to test> expected: <expected outcome> weight: 1.0 threshold: 0.8 pass_at_k: 1 ``` ## Grader Types ### Code Grader (deterministic) - Run a command, check exit code or parse output - Examples: test suite pass/fail, type checker, linter count, benchmark time - Fastest, most reliable — prefer this when possible ### Model Grader (LLM-as-judge) - Send output to an LLM with a rubric, get a score - Use for: prompt quality, code readability, documentation completeness - Always include ...

Details

Author: Silex-Research
Repository: Silex-Research/DontPanic
Created: 4 months ago
Last Updated: yesterday
Language: Python
License: Apache-2.0

Install

Quality Score: 82/100

Skill Content

Details

Similar Skills

eval-harness

eval-harness

eval-harness