eval-analysislisted
Install: claude install-skill eforge-build/eforge
# /eval-analysis
Structured methodology for analyzing eforge eval results, identifying signal patterns, and proposing changes with anti-bias safeguards.
## Prerequisites
- The eval harness MCP server must be connected (provides `eval_runs`, `eval_observations`, `eval_scenario_detail`, `eval_run`, `eval_results` tools)
- You should be in or have access to the eforge project codebase
## Workflow
### Step 1: Gather Recent Eval Data
Start by checking what eval data is available:
1. Use `eval_runs` to list recent eval runs. Note run IDs, timestamps, and which scenarios were included.
2. If the user mentions a specific run or comparison, use `eval_results` with the relevant run ID(s).
3. If comparing baseline vs candidate, use `eval_results` with the `compare` parameter to get a structured regression comparison between two runs.
### Step 2: Pull Observations
For runs of interest, use `eval_observations` to get detailed per-scenario observation data. This gives you the raw signal - scores, pass/fail, and any metadata attached to each observation.
Focus on:
- Scenarios with low scores or failures
- Scenarios where scores changed significantly between runs
- Patterns across scenario categories
### Step 3: Drill Into Affected Scenarios
For each scenario showing issues or regressions, use `eval_scenario_detail` to get the full scenario specification - inputs, expected outputs, scoring criteria, and any notes.
This is critical context: you need to understand what the scenari