eval-guidelisted
Install: claude install-skill coder/agent-tty
# Eval Guide
Use this guide when you are trying to answer **"did this skill or prompt change actually help?"** for `agent-tty` evals.
The short version: **do not trust a single run**. This eval stack now supports multi-trial sampling, parallel execution, trial aggregation, and paired baseline comparison because the underlying model behavior is noisy enough that one pass/fail result is not decision-grade.
## 1. What we learned about eval non-determinism
- Identical serial reruns showed a **~15-17% pass/fail flip rate** in practice, across both Codex and Claude runs.
- Scores moved even more often than hard pass/fail: **~30-39% of identical reruns changed score**.
- The movement was directionally balanced, which is the important point: this looked like **noise**, not a systematic drift up or down.
- Cross-provider checks reinforced that conclusion: in the parallel safety analysis, **Codex and Claude shared zero common regressions**, which is strong evidence that parallelism itself was not introducing consistent failures.
- Treat these findings as the baseline noise floor. If your "improvement" is smaller than that noise, it is not persuasive.
## 2. Run evals with enough statistical power
Always set `--trials` for real prompt or skill experiments.
Recommended trial counts:
- **Prompt lane:** `--trials 5` to `--trials 10`
- **Execution lane:** `--trials 3`
- **Dogfood lane:** `--trials 2` to `--trials 3`
Use concurrency to keep those sample sizes affordable:
- Start wit