prompt-evallisted
Install: claude install-skill bakw00ds/yakos
# Prompt Eval
## Purpose
Score a prompt change against a fixed golden dataset and surface
regressions per rubric — the failure modes that hide inside an
aggregate "92% accuracy" number. Pins three things so the run is
reproducible:
- **Model version.** The exact provider model id (e.g.,
`claude-opus-4-7`, `gpt-5.1-2026-04-15`), not an alias.
- **Rubric version.** A content hash of the rubric file at eval time.
- **Dataset version.** A content hash of the dataset at eval time.
Without those three pins, week-over-week comparisons drift silently
when the model alias rolls forward or someone edits a rubric line.
## Scope
- Reads a prompt from `<project>/prompts/<prompt-id>/{system.md,user.tmpl}`.
- Reads a dataset of `(input, expected, rubric_subset)` triples from
`<project>/eval/datasets/<dataset>.jsonl`.
- Reads rubrics from `<project>/eval/rubrics/<name>.yaml`. Each rubric
is a named pass/fail (or 0–1 scalar) check applied to the model's
output for a given input.
- Runs the prompt over the dataset against the configured model,
scores each example per rubric, writes results to
`<project>/eval/runs/<run-id>/`.
- Diffs against the baseline run (default: most recent run on `main`)
and reports per-rubric deltas. Aggregate score is included but
secondary.
- Designed for `eval-engineer` and `prompt-engineer` to share. The
prompt-engineer iterates the prompt; the eval-engineer owns the
rubric and dataset.
## When to use
- Before merging a prompt change, to