eval-guidelisted

Guide for running statistically meaningful agent-tty evals with trials, parallelism, and A/B comparison. Covers non-determinism baseline, recommended sample sizes, and result interpretation.
coder/agent-tty · ★ 4 · Code & Development · score 70

Install: claude install-skill coder/agent-tty

# Eval Guide Use this guide when you are trying to answer **"did this skill or prompt change actually help?"** for `agent-tty` evals. The short version: **do not trust a single run**. This eval stack now supports multi-trial sampling, parallel execution, trial aggregation, and paired baseline comparison because the underlying model behavior is noisy enough that one pass/fail result is not decision-grade. ## 1. What we learned about eval non-determinism - Identical serial reruns showed a **~15-17% pass/fail flip rate** in practice, across both Codex and Claude runs. - Scores moved even more often than hard pass/fail: **~30-39% of identical reruns changed score**. - The movement was directionally balanced, which is the important point: this looked like **noise**, not a systematic drift up or down. - Cross-provider checks reinforced that conclusion: in the parallel safety analysis, **Codex and Claude shared zero common regressions**, which is strong evidence that parallelism itself was not introducing consistent failures. - Treat these findings as the baseline noise floor. If your "improvement" is smaller than that noise, it is not persuasive. ## 2. Run evals with enough statistical power Always set `--trials` for real prompt or skill experiments. Recommended trial counts: - **Prompt lane:** `--trials 5` to `--trials 10` - **Execution lane:** `--trials 3` - **Dogfood lane:** `--trials 2` to `--trials 3` Use concurrency to keep those sample sizes affordable: - Start wit