ai-reliability-evallisted
Install: claude install-skill arcasilesgroup/ai-engineering
# Reliability Eval
## Purpose
Eval-Driven Development (EDD) treats evals as the unit tests of AI development. Define pass/fail criteria before writing code. Measure AI reliability with pass@k metrics. Track regressions across prompt, agent, and model changes. Evals answer the question: "Can the AI do this reliably?"
**Key distinction**: `ai-verify` checks current code quality (linting, coverage, security). `ai-reliability-eval` measures AI reliability over time (can the agent complete this task consistently?).
## When to Use
- `define`: defining pass/fail criteria before implementation (EDD principle)
- `check`: running current evals and reporting status mid-implementation
- `report`: generating full eval report after implementation
- `regression`: ensuring changes to prompts, agents, or models don't break existing capabilities
- `--skill-set`: skill-set mode — runs the optimizer over each skill's eval corpus under `.ai-engineering/evals/<skill>.jsonl` and gates pass@1 vs `.ai-engineering/evals/baseline.json`. Combine with `--regression` to fail on >5 pp pass@1 drop (sub-007 M6, D-127-07). Wired into `.github/workflows/skill-evals.yml` on PRs touching `.agents/skills/**`.
## Process
### Mode: define (Before Coding)
1. Identify the capability being built or changed
2. Write capability evals (can the AI do this new thing?)
3. Write regression evals (do existing things still work?)
4. Set success metrics (pass@k targets)
5. Store eval definition at `.ai-engineering/evals