golden-evallisted
Install: claude install-skill neuralforge-labs/tlmforge
# Golden eval — drift detection for Claude Code
When Anthropic ships a model change, your reviewer pipeline can silently degrade. Golden
eval catches this by running a fixed set of reference tasks weekly and comparing to a
recorded baseline. Diff in cost, latency, or pass/fail → alarm.
## When to use
**Triggers:**
- User says "run golden eval", "check for drift", "did Claude get worse"
- A scheduled run fires (Anthropic Routine via `/schedule`, or system cron)
- Before/after a model upgrade announcement (check baseline holds)
**When NOT to use:**
- Ad-hoc one-off tests — golden eval is for *fixed* corpus over time
- Replacing real tests — eval is a regression sentinel, not a substitute for unit/integration tests
## Architecture
```
tasks/ # fixed corpus, one .yaml per task
T01-add-constant.yaml
T02-fix-typo.yaml
T03-refactor-fn.yaml
↓
runner.py # loads each task, executes, captures cost/latency/output
↓
baselines/<task>.json # baseline metrics from a known-good run
↓
report.json # current run vs baseline diffs
↓
notify.py # PushNotification if any task regressed > threshold
```
## Task format
Each task is a small YAML file in `~/.claude/skills/golden-eval/tasks/`:
```yaml
# tasks/T01-add-constant.yaml
id: T01
title: Add a constant to a config file
input_prompt: >
In the file ./test_fixture/config.py, add a constant MAX_RETRIES = 3 ju