← ClaudeAtlas

write-evallisted

Write a live eval for new or changed runner/agent behavior using red/green TDD plus a falsification check that proves the eval fails when the behavior is broken. Use whenever you add or modify behavior that should be covered by an eval, when asked to "write an eval", "add an eval", "cover this with an eval", or after landing a feature that needs end-to-end proof it works.
dzhng/duet-agent · ★ 34 · AI & Automation · score 85
Install: claude install-skill dzhng/duet-agent
# Write an Eval An eval is only trustworthy if you have seen it both **fail for the right reason** and **pass for the right reason**. Writing the assertions, watching them go green once, and moving on is how you ship an eval that passes whether or not the feature works. The standard operating procedure is: pick the outermost entry point, design the assertion so it can only hold when the behavior is present, watch it go green, then **falsify** — break the production code, confirm the eval goes red with a diagnostic that points at the real path, and restore. This is the flow used to land `evals/state-machine-slash-skill-expansion.eval.ts`; read it as the reference implementation for a **deterministic wiring** eval (the feature either injects the right context or it doesn't). When the behavior under test is a **model tendency** rather than deterministic wiring — "the planner doesn't over-reach into implementation", "the sub-agent doesn't drift into chat mode", anything a prompt layer nudges but cannot guarantee — the single-run flow is not enough, because one run is a coin flip. Read `evals/state-machine-agent-stays-in-state-scope.eval.ts` as the reference for that shape, and follow §6 below in addition to §1–5. ## 1. Drive the outermost entry point Per AGENTS.md and the review skill (§13): test behavior through the surface a user actually hits, not internal helpers. - A unit test on the pure function (e.g. `test/skill-context-resolve.test.ts` for `resolveSlashSkillPrompt`