model-benchmarkinglisted
Install: claude install-skill RBraga01/builder-ai
# Model Benchmarking
## The Law
```
A MODEL IS NOT CHOSEN UNTIL IT HAS BEEN TESTED ON YOUR TASK DATA.
Leaderboard scores are averaged over tasks you're not building.
"It's the best model" is a claim about benchmarks someone else ran.
Task data + defined metric + three tested tiers IS a model selection.
```
## When to Use
Trigger when:
- Choosing a model for any new production feature
- Considering a switch to a newer, cheaper, or faster model
- Validating that a smaller model can replace a larger one
- Comparing providers (OpenAI, Anthropic, Mistral, local)
## When NOT to Use
- Initial feasibility exploration (before task definition is stable) — benchmark when you know what you're measuring
- Model compatibility checks (does this model support tool use, JSON mode, etc.) — that's a capability query, not a benchmark
- Leaderboard research to narrow the candidate list — that's input to Stage 3, not a substitute for it
## The Process
Four stages. Do not collapse them.
### Stage 1 — Define the Bar
Before running any model, write down:
| Decision | What to Define |
|---|---|
| Task | Exact input format, exact output format, edge cases |
| Metric | Accuracy, faithfulness score, extraction F1, LLM-as-judge, task pass rate |
| Pass threshold | The minimum score to go to production (e.g., ≥ 88%) |
| Budget constraint | Max cost per 1k calls, max monthly spend |
| Latency constraint | Max acceptable p95 (e.g., ≤ 2.0s) |
If you cannot define the metric and threshold first, th