← ClaudeAtlas

bt-tournamentlisted

Rank competing hypotheses with online Bradley-Terry updates and LUCB intervals. Replaces elo-select. Use whenever there are 3 or more candidate hypotheses competing for the next experiment, or when the user wants to understand which branch is currently leading.
whenpoem/aiscientist · ★ 6 · AI & Automation · score 73
Install: claude install-skill whenpoem/aiscientist
# BT Tournament This skill is the V3.0 successor to `elo-select`. It runs an online Bradley-Terry tournament instead of one-shot Elo updates and exposes 95% confidence intervals so "low-confidence ties" stay visible. ## When to invoke - Researcher subagent has just emitted >= 3 hypothesis nodes in one turn. - The user explicitly typed `/bt-tournament`. - The cockpit shows two candidates whose 95% intervals overlap and the user asks which to push first. ## Workflow 1. Gather the candidate hypothesis node ids and texts from `mcp__memory__get_active_frontier`. 2. For each pair you intend to compare, call `mcp__memory__judge_hypotheses` to fetch the canonical comparison prompt. Evaluate inline (do not spawn a sub-agent just to judge). 3. Decide a winner. Call `mcp__memory__record_judgement(a, b, winner, reason)`. Internally this dual-writes to the legacy Elo ledger AND the BT comparison ledger; you do not need to call `update_bt_rating` separately. 4. Pull the leaderboard via `mcp__memory__get_bt_leaderboard(top_k=10)`. Look at `strength`, `lcb`, `ucb`, `n_comparisons`, and `insufficient_samples`. 5. Decide whether to run another comparison. Stop when: - The top-2 LCB / UCB intervals no longer overlap (clear winner), OR - At least 3 comparisons have been logged for every candidate, OR - The user already approved a target. 6. Hand off the top-2 to the engineer. Quote each hypothesis's BT strength and the 95% interval. If `insufficient_samples` is true for any winner,