← ClaudeAtlas

model-benchmarkinglisted

Use when selecting a model for any production feature, or evaluating whether to switch models. Requires task-specific benchmarking — not leaderboard lookup. Blocks "GPT-4 is the best model" decisions.
RBraga01/builder-ai · ★ 2 · AI & Automation · score 68
Install: claude install-skill RBraga01/builder-ai
# Model Benchmarking ## The Law ``` A MODEL IS NOT CHOSEN UNTIL IT HAS BEEN TESTED ON YOUR TASK DATA. Leaderboard scores are averaged over tasks you're not building. "It's the best model" is a claim about benchmarks someone else ran. Task data + defined metric + three tested tiers IS a model selection. ``` ## When to Use Trigger when: - Choosing a model for any new production feature - Considering a switch to a newer, cheaper, or faster model - Validating that a smaller model can replace a larger one - Comparing providers (OpenAI, Anthropic, Mistral, local) ## When NOT to Use - Initial feasibility exploration (before task definition is stable) — benchmark when you know what you're measuring - Model compatibility checks (does this model support tool use, JSON mode, etc.) — that's a capability query, not a benchmark - Leaderboard research to narrow the candidate list — that's input to Stage 3, not a substitute for it ## The Process Four stages. Do not collapse them. ### Stage 1 — Define the Bar Before running any model, write down: | Decision | What to Define | |---|---| | Task | Exact input format, exact output format, edge cases | | Metric | Accuracy, faithfulness score, extraction F1, LLM-as-judge, task pass rate | | Pass threshold | The minimum score to go to production (e.g., ≥ 88%) | | Budget constraint | Max cost per 1k calls, max monthly spend | | Latency constraint | Max acceptable p95 (e.g., ≤ 2.0s) | If you cannot define the metric and threshold first, th