eval-runnerlisted
Install: claude install-skill haabe/mycelium
# Eval Runner
Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.
## Commands
### `run <category/name>`
1. Read YAML from `.claude/evals/scenarios/<category>/<name>.yml`
2. Parse fields (name, category, task_prompt, success_criteria, budget)
3. Execute setup steps if defined
4. Record start time
5. Execute task via reflexion workflow (read corrections first)
6. Record end time and iteration count
7. Validate ALL success criteria
8. Write result JSON to `.claude/evals/results/<timestamp>-<name>.json`
9. Report summary
### `run-all [category]`
1. Glob `.claude/evals/scenarios/**/*.yml`
2. Skip scenarios with `status: retired`
3. For each: run in isolation (git stash), record result, restore
4. Update `.claude/evals/pass-history.json` with each result
5. Aggregate and report
### `run-split <optimization|holdout>`
1. Glob `.claude/evals/scenarios/**/*.yml`
2. Read each YAML, filter by `split` field matching the requested set
3. Skip scenarios with `status: retired`
4. For each matching scenario: run in isolation, record result, restore
5. Update `.claude/evals/pass-history.json` with each result
6. Aggregate and report (label output clearly as "Optimization Set" or "Holdout Set")
### `report`
1. Read all results from `.claude/evals/results/`
2. Generate summary table:
```
| Category | Pass Rate | Avg Iterations | Avg Time | Notes |
|-------------|-----------|----------------|----------|-------|
| discovery | ... | ...