golden-evallisted

Use this skill to run a fixed corpus of reference tasks against the current Claude configuration and detect regression vs a baseline. Triggers on "run golden eval", "check for drift", "did Claude get worse", "eval drift", or via a scheduled invocation (cron / Anthropic Routine via the /schedule skill). Captures cost, latency, and per-task pass/fail; flags any regression beyond a configurable threshold. Output: a JSON report at `~/.claude/skills/golden-eval/reports/<timestamp>.json` and (if regression detected) a PushNotification with the summary.
neuralforge-labs/tlmforge · ★ 4 · AI & Automation · score 55

Install: claude install-skill neuralforge-labs/tlmforge

# Golden eval — drift detection for Claude Code When Anthropic ships a model change, your reviewer pipeline can silently degrade. Golden eval catches this by running a fixed set of reference tasks weekly and comparing to a recorded baseline. Diff in cost, latency, or pass/fail → alarm. ## When to use **Triggers:** - User says "run golden eval", "check for drift", "did Claude get worse" - A scheduled run fires (Anthropic Routine via `/schedule`, or system cron) - Before/after a model upgrade announcement (check baseline holds) **When NOT to use:** - Ad-hoc one-off tests — golden eval is for *fixed* corpus over time - Replacing real tests — eval is a regression sentinel, not a substitute for unit/integration tests ## Architecture ``` tasks/ # fixed corpus, one .yaml per task T01-add-constant.yaml T02-fix-typo.yaml T03-refactor-fn.yaml ↓ runner.py # loads each task, executes, captures cost/latency/output ↓ baselines/<task>.json # baseline metrics from a known-good run ↓ report.json # current run vs baseline diffs ↓ notify.py # PushNotification if any task regressed > threshold ``` ## Task format Each task is a small YAML file in `~/.claude/skills/golden-eval/tasks/`: ```yaml # tasks/T01-add-constant.yaml id: T01 title: Add a constant to a config file input_prompt: > In the file ./test_fixture/config.py, add a constant MAX_RETRIES = 3 ju