calibratelisted

Codex-native calibration loop. Use to detect leaks or major gaps across mirrored skills and agents with fixed checks plus behavioral recall, precision, and confidence-accuracy scoring.
Borda/AI-Rig · ★ 19 · AI & Automation · score 71

Install: claude install-skill Borda/AI-Rig

# Calibrate Run a linear calibration loop for codex workflow integrity and behavioral scoring. ## Input Schema ```json { "scope": "skills|agents|routing|all", "pace": "fast|full", "mode": "ab-test|apply", "skip_gate": false, "done_when": "recall and bias scores emitted; proposals written if mode=apply; gate skipped if skip_gate=true" } ``` ## Workflow 01. Load calibration task set from `.codex/calibration/tasks.json`. 02. Load behavioral cases from `.codex/calibration/behavioral-cases.json`. 03. Load behavioral observations from `.codex/calibration/behavioral-observations.jsonl`. - Require `source`, `run_id`, and `observed_at` on each observation where available. 04. Run `.codex/calibration/run.sh`. 05. Inspect `checks_failed`, `leaks_found`, and `behavioral`. 06. Review behavioral metrics: - `recall`: expected finding IDs recovered from known cases. - `precision`: reported finding IDs that match expected finding IDs. - `confidence_accuracy`: `1 - mean(abs(confidence - per-case F1))`. - `mean_overconfidence`: average positive confidence bias over per-case F1. - `gate_metrics_raw`: unrounded overall values used for pass/fail thresholds. - `by_source`: recall, precision, and confidence calibration grouped by observation source. - `observation_freshness`: latest `observed_at`, missing timestamp count, and live-vs-fixture observation counts. 07. Classify gaps as blocking or non-blocking. 08. Emit measured recommendations for what sho