write-eval

Solid

Write a live eval for new or changed runner/agent behavior using red/green TDD plus a falsification check that proves the eval fails when the behavior is broken. Use whenever you add or modify behavior that should be covered by an eval, when asked to "write an eval", "add an eval", "cover this with an eval", or after landing a feature that needs end-to-end proof it works.

AI & Automation 41 stars 4 forks Updated 2 days ago Apache-2.0

Install

View on GitHub

Quality Score: 87/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Write an Eval An eval is only trustworthy if you have seen it both **fail for the right reason** and **pass for the right reason**. Writing the assertions, watching them go green once, and moving on is how you ship an eval that passes whether or not the feature works. The standard operating procedure is: pick the outermost entry point, design the assertion so it can only hold when the behavior is present, watch it go green, then **falsify** — break the production code, confirm the eval goes red with a diagnostic that points at the real path, and restore. This is the flow used to land `evals/state-machine-slash-skill-expansion.eval.ts`; read it as the reference implementation for a **deterministic wiring** eval (the feature either injects the right context or it doesn't). When the behavior under test is a **model tendency** rather than deterministic wiring — "the planner doesn't over-reach into implementation", "the sub-agent doesn't drift into chat mode", anything a prompt layer nudges but cannot guarantee — the single-run flow is not enough, because one run is a coin flip. Read `evals/state-machine-agent-stays-in-state-scope.eval.ts` as the reference for that shape, and follow §6 below in addition to §1–5. ## 1. Drive the outermost entry point Per AGENTS.md and the review skill (§13): test behavior through the surface a user actually hits, not internal helpers. - A unit test on the pure function (e.g. `test/skill-context-resolve.test.ts` for `resolveSlashSkillPrompt`...

Details

Author: dzhng
Repository: dzhng/duet-agent
Created: 3 months ago
Last Updated: 2 days ago
Language: TypeScript
License: Apache-2.0

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

eval-writer

Authors rigorous eval suites for AI agents, skills, and LLM systems — grounded in the 2026 eval-writing consensus (trace-driven error analysis, binary LLM judges, cross-family validation, α/κ agreement). Produces characterization, failure taxonomies, judge prompts, rubrics, and calibration protocols that harnesses (pmo-skill-refiner, CI) then execute. Two modes — Author (write from scratch) and Review (audit against the framework). First-class playbooks for per-skill evals and for pipeline stage-gate judgment content; generic fallback for arbitrary AI systems. Use whenever the user asks to write evals, audit evals, add eval coverage, calibrate a judge, build a rubric, write a judge prompt, or diagnose why a judge keeps passing broken outputs.

0 Updated today

cody-hutson

AI & Automation Solid

eval-skills

Eval and improve a skill against golden cases — run the target skill blind in a fresh, context-free subagent on each example input, grade the artifact against the expected outcome, and let the gaps drive the edits. Use when the user wants to test/eval/improve/harden a skill, says "this skill keeps producing X / keeps missing Y", or hands a skill plus example input→expected-output pairs. Pairs with [write-skills](../write-skills/SKILL.md) (the authoring principles every fix obeys).

41 Updated 2 days ago

dzhng

AI & Automation Listed

build-agent-evals

Build automated evaluations for an AI agent from scratch: collecting tasks from real failures, choosing code/model/human graders, picking pass@k vs pass^k, building an isolated harness, and keeping the suite honest over time. Use this whenever someone wants to measure, benchmark, or regression-test an agent, write an eval harness for an LLM agent, decide how to grade non-deterministic output, set up an LLM-as-judge, or asks any version of "how do I know if my agent is actually getting better." Trigger even when they say "tests for my agent," "eval set," or "agent benchmark" rather than the word "evals," or when they ask about benchmark contamination or a model recognizing the eval. Not for container or resource limits making scores flaky across runs; that's calibrate-eval-infrastructure.

1 Updated 5 days ago

Hoja-Solutions