← ClaudeAtlas

regen-eval-baselinelisted

Captures a promptfoo skill-eval baseline JSON for one sumo-qa skill and snapshots it to docs/qa/runs/eval-baselines/, with an automatic delta against the prior snapshot. Use this whenever the user mentions baselining a skill, capturing a before/after eval, running a single-skill eval, or measuring the effect of a SKILL.md edit — common during token-optimisation rounds. The actual work runs through a bundled script that handles path conventions, API-key checks, and diffing in one go.
sumithr/sumo-qa · ★ 4 · Testing & QA · score 73
Install: claude install-skill sumithr/sumo-qa
# regen-eval-baseline Captures a promptfoo run for one sumo-qa skill and stores its JSON output in `docs/qa/runs/eval-baselines/` (gitignored). The deterministic work lives in `scripts/run_baseline.py`; this document is the guide for picking inputs and reading the output. ## When to use Trigger this skill when the user wants a per-skill eval snapshot. Common phrasings: "baseline this skill", "snapshot the eval", "run the eval for skill X", "capture before/after for the rewrite I just made". The user invokes it explicitly with `/regen-eval-baseline`; it doesn't auto-trigger. This is single-skill on purpose. Full-sweep regeneration belongs on `npm run eval:all`, which also reads the same `tests/evals/promptfoo/skill-*.yaml` files but runs them sequentially without snapshotting. ## Inputs Pass **exactly one** config selector — either `--skill` (the base config) or `--config` (a suffixed scenario / `.ab.yaml` control / explicit path) — plus an optional `--label`. 1. **`--skill <name>`** — the base config `tests/evals/promptfoo/skill-<name>.yaml`. Resolves that file **exactly**; it never cross-matches a longer suffixed sibling (e.g. `--skill reviewing-before-merge` drives `skill-reviewing-before-merge.yaml`, NOT `skill-reviewing-before-merge-adversarial.yaml`). If the named config is absent the script lists the available configs. 2. **`--config <selector>`** — for a suffixed scenario config or an `.ab.yaml` A/B control. Accepts a bare stem (`skill-reviewing-before-merge-adv