benchmarklisted

Measure recall / precision / false-proof rate of the pipeline against a ground-truth manifest. Scores either the bundled planted-vulnerability corpus (regression) or a live run's findings.json against a manifest you supply. Deterministic — no agent, no network. Use to prove a change to the producers helps rather than hurts.
allsmog/kuzushi-security-plugin · ★ 0 · AI & Automation · score 61

Install: claude install-skill allsmog/kuzushi-security-plugin

# Benchmark You can't call bug-finding "world-class" — or catch a regression in it — without a number. `/benchmark` scores findings against ground truth and reports the three metrics that matter: **recall** (are we missing bugs?), **precision** (do we cry wolf?), and **falseProofRate** (did we *prove* a non-bug? — the soundness failure differential testing guards). ## Run it - **Bundled corpus (regression):** `node "${CLAUDE_PLUGIN_ROOT}/scripts/cmd/benchmark.mjs"` scores every case under `bench/cases/` using its recorded `findings.json`. Add `--case <name>` for one case. - **A live run:** `node "${CLAUDE_PLUGIN_ROOT}/scripts/cmd/benchmark.mjs" --target "<repo>" --ground-truth "<manifest.json>"` scores `<repo>/.kuzushi/findings.json` after you've run the pipeline. Flags: `--strict` (an active finding matching no expectation counts as a false positive — only fair when the manifest is exhaustive), `--line-tolerance N` (default 5), `--no-match-cwe` (match on file+line only). ## Ground-truth manifest `{ "expectations": [ { "id", "kind": "vuln" | "safe", "cwe", "filePath", "line" } ] }`. A `vuln` is a real bug the tool **should** find; a `safe` is a decoy that looks like one and **must not** be flagged. A decoy that gets an active finding is a false positive; a decoy that gets a *proven* finding is a false proof. Author manifests from confirmed bugs (and their guarded siblings) so the corpus encodes both recall and precision pressure. ## Reading the result `corpu