eval-analysislisted

Analyze eval results to understand signal quality and guide prompt/config changes. Use when the user says "eval-analysis", "analyze eval results", "what do the evals show", "eval regression", or asks about eval signal enrichment.
eforge-build/eforge · ★ 65 · AI & Automation · score 84

Install: claude install-skill eforge-build/eforge

# /eval-analysis Structured methodology for analyzing eforge eval results, identifying signal patterns, and proposing changes with anti-bias safeguards. ## Prerequisites - The eval harness MCP server must be connected (provides `eval_runs`, `eval_observations`, `eval_scenario_detail`, `eval_run`, `eval_results` tools) - You should be in or have access to the eforge project codebase ## Workflow ### Step 1: Gather Recent Eval Data Start by checking what eval data is available: 1. Use `eval_runs` to list recent eval runs. Note run IDs, timestamps, and which scenarios were included. 2. If the user mentions a specific run or comparison, use `eval_results` with the relevant run ID(s). 3. If comparing baseline vs candidate, use `eval_results` with the `compare` parameter to get a structured regression comparison between two runs. ### Step 2: Pull Observations For runs of interest, use `eval_observations` to get detailed per-scenario observation data. This gives you the raw signal - scores, pass/fail, and any metadata attached to each observation. Focus on: - Scenarios with low scores or failures - Scenarios where scores changed significantly between runs - Patterns across scenario categories ### Step 3: Drill Into Affected Scenarios For each scenario showing issues or regressions, use `eval_scenario_detail` to get the full scenario specification - inputs, expected outputs, scoring criteria, and any notes. This is critical context: you need to understand what the scenari