prompt-evallisted

Run a prompt against a golden dataset and report per-rubric regression vs the prior pinned version, not just an aggregate score
bakw00ds/yakos · ★ 2 · AI & Automation · score 81

Install: claude install-skill bakw00ds/yakos

# Prompt Eval ## Purpose Score a prompt change against a fixed golden dataset and surface regressions per rubric — the failure modes that hide inside an aggregate "92% accuracy" number. Pins three things so the run is reproducible: - **Model version.** The exact provider model id (e.g., `claude-opus-4-7`, `gpt-5.1-2026-04-15`), not an alias. - **Rubric version.** A content hash of the rubric file at eval time. - **Dataset version.** A content hash of the dataset at eval time. Without those three pins, week-over-week comparisons drift silently when the model alias rolls forward or someone edits a rubric line. ## Scope - Reads a prompt from `<project>/prompts/<prompt-id>/{system.md,user.tmpl}`. - Reads a dataset of `(input, expected, rubric_subset)` triples from `<project>/eval/datasets/<dataset>.jsonl`. - Reads rubrics from `<project>/eval/rubrics/<name>.yaml`. Each rubric is a named pass/fail (or 0–1 scalar) check applied to the model's output for a given input. - Runs the prompt over the dataset against the configured model, scores each example per rubric, writes results to `<project>/eval/runs/<run-id>/`. - Diffs against the baseline run (default: most recent run on `main`) and reports per-rubric deltas. Aggregate score is included but secondary. - Designed for `eval-engineer` and `prompt-engineer` to share. The prompt-engineer iterates the prompt; the eval-engineer owns the rubric and dataset. ## When to use - Before merging a prompt change, to