← ClaudeAtlas

test-effectiveness-auditorlisted

Quantitatively measure how effective a project's automated tests are at catching real bugs. Use this skill when: (1) the user asks 'how good are our tests?', 'do our tests actually catch bugs?', 'measure test effectiveness', or 'audit our test suite'; (2) a team has anecdotal impressions about test quality but no data; (3) before investing in more tests, to identify which gaps matter most; (4) after an incident slipped through CI, to understand whether the test suite should have caught it; (5) when evaluating whether a CI pipeline is paying for itself. Produces a report at ~/Documents/<project>_test_effectiveness_audit.md with per-incident catch rates, a classified gap list, and targeted recommendations. Read-only relative to project source — does not modify code or auto-write tests.
wan-huiyan/claude-ecosystem-hygiene · ★ 0 · Data & Documents · score 70
Install: claude install-skill wan-huiyan/claude-ecosystem-hygiene
# Test Effectiveness Auditor v1.0 Answers the question: **how helpful are our automated tests at catching bugs?** Not by proxy metrics like coverage percent or test count, but by replaying real bugs that already happened and checking whether the test suite, as it stood just before the fix, actually failed on the buggy commit. The honest baseline for "are tests worth it" is historical: bugs that made it to production despite the tests are the direct evidence of gaps; CI failures that forced a code change before merge are the direct evidence of catches. Everything else is speculation. ## Why this matters Most teams measure test health by coverage % (e.g. pytest --cov). Coverage tells you which lines executed, not whether any assertion would have *failed* when the behavior was wrong. A line can be 100% covered by a test that would pass under the bug. This audit inverts the question: take known bugs, rewind to the pre-fix commit, and observe whether the suite catches them. Two methods, in priority order: 1. **Historical incident replay** (primary signal) — for each documented bug, check out the pre-fix SHA in a worktree, run the suite, observe pass/fail, and classify. 2. **CI history analysis** (secondary signal) — pull CI runs that forced a pre-merge change, classify by whether the failure represented a real logic/data/integration catch vs. noise (lint, formatting, flaky). Mutation testing, test-layer ablation, and coverage-delta analysis are deliberately NOT in scope for