eval-runner

Install

View on GitHub

Quality Score: 86/100

Stars 20%

54

Recency 20%

100

Frontmatter 20%

70

Documentation 15%

100

Issue Health 10%

80

License 10%

100

Description 5%

100

Skill Content

# Eval Runner Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system. ## Commands ### `run <category/name>` 1. Read YAML from `.claude/evals/scenarios/<category>/<name>.yml` 2. Parse fields (name, category, task_prompt, success_criteria, budget) 3. Execute setup steps if defined 4. Record start time 5. Execute task via reflexion workflow (read corrections first) 6. Record end time and iteration count 7. Validate ALL success criteria 8. Write result JSON to `.claude/evals/results/<timestamp>-<name>.json` 9. Report summary ### `run-all [category]` 1. Glob `.claude/evals/scenarios/**/*.yml` 2. Skip scenarios with `status: retired` 3. For each: run in isolation (git stash), record result, restore 4. Update `.claude/evals/pass-history.json` with each result 5. Aggregate and report ### `run-split <optimization|holdout>` 1. Glob `.claude/evals/scenarios/**/*.yml` 2. Read each YAML, filter by `split` field matching the requested set 3. Skip scenarios with `status: retired` 4. For each matching scenario: run in isolation, record result, restore 5. Update `.claude/evals/pass-history.json` with each result 6. Aggregate and report (label output clearly as "Optimization Set" or "Holdout Set") ### `report` 1. Read all results from `.claude/evals/results/` 2. Generate summary table: ``` | Category | Pass Rate | Avg Iterations | Avg Time | Notes | |-------------|-----------|----------------|----------|-------| | discovery | ... | ... ...

Details

Author: haabe
Repository: haabe/mycelium
Created: 3 months ago
Last Updated: today
Language: Python
License: MIT

Install

Quality Score: 86/100

Skill Content

Details

Similar Skills

prompt-optimizer

cli-eval

agent-eval