auto-arena

Solid

Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.

AI & Automation 633 stars 54 forks Updated 3 days ago Apache-2.0

Install

View on GitHub

Quality Score: 92/100

Stars 20%
93
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Auto Arena Skill End-to-end automated model comparison using the OpenJudge `AutoArenaPipeline`: 1. **Generate queries** — LLM creates diverse test queries from task description 2. **Collect responses** — query all target endpoints concurrently 3. **Generate rubrics** — LLM produces evaluation criteria from task + sample queries 4. **Pairwise evaluation** — judge model compares every model pair (with position-bias swap) 5. **Analyze & rank** — compute win rates, win matrix, and rankings 6. **Report & charts** — Markdown report + win-rate bar chart + optional matrix heatmap ## Prerequisites ```bash # Install OpenJudge pip install py-openjudge # Extra dependency for auto_arena (chart generation) pip install matplotlib ``` ## Gather from user before running | Info | Required? | Notes | |------|-----------|-------| | Task description | Yes | What the models/agents should do (set in config YAML) | | Target endpoints | Yes | At least 2 OpenAI-compatible endpoints to compare | | Judge endpoint | Yes | Strong model for pairwise evaluation (e.g. `gpt-4`, `qwen-max`) | | API keys | Yes | Env vars: `OPENAI_API_KEY`, `DASHSCOPE_API_KEY`, etc. | | Number of queries | No | Default: `20` | | Seed queries | No | Example queries to guide generation style | | System prompts | No | Per-endpoint system prompts | | Output directory | No | Default: `./evaluation_results` | | Report language | No | `"zh"` (default) or `"en"` | ## Quick start ### CLI ```bash # Run evaluation python -m coo...

Details

Author
agentscope-ai
Repository
agentscope-ai/OpenJudge
Created
10 months ago
Last Updated
3 days ago
Language
Python
License
Apache-2.0

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

ref-hallucination-arena

Benchmark LLM reference recommendation capabilities by verifying every cited paper against Crossref, PubMed, arXiv, and DBLP. Measures hallucination rate, per-field accuracy (title/author/year/DOI), discipline breakdown, and year constraint compliance. Supports tool-augmented (ReAct + web search) mode. Use when the user asks to evaluate, benchmark, or compare models on academic reference hallucination, literature recommendation quality, or citation accuracy.

633 Updated 3 days ago
agentscope-ai
AI & Automation Solid

openjudge

Build custom LLM evaluation pipelines using the OpenJudge framework. Covers selecting and configuring graders (LLM-based, function-based, agentic), running batch evaluations with GradingRunner, combining scores with aggregators, applying evaluation strategies (voting, average), auto-generating graders from data, and analyzing results (pairwise win rates, statistics, validation metrics). Use when the user wants to evaluate LLM outputs, compare multiple models, design scoring criteria, or build an automated evaluation system.

633 Updated 3 days ago
agentscope-ai
AI & Automation Solid

agent-arena

Use when complex AI agent work needs heterogeneous multi-agent debate, red teaming, evidence checking, judging, or synthesis across Codex, Claude Code, Hermes, OpenClaw, and other coding agents.

23 Updated yesterday
zhjai
AI & Automation Listed

auto-itera

Use when the user wants to autonomously search for the best AI/engineering approach across competing candidates (prompts, models, retrieval strategies, architectures, algorithms) — give it a goal + candidate arms + success threshold, it runs the experiment to a defensible ship-or-kill verdict. Autonomously handles sourcing real production data, scoring arms in parallel, diagnosing per-row, sprint-and-generalize iteration, and writing the conclusion doc. Built-in safeguards (held-out test discipline, variance-floor checks, generalization gates) keep the verdict trustworthy.

4 Updated yesterday
clfhaha1234
AI & Automation Listed

agent-evaluation

Evaluate LLM agents and tool-using workflows—task success, tool accuracy, latency/cost, safety, and regression suites. Use when shipping agent features, comparing prompts/models, or debugging agent failures.

15 Updated 2 days ago
charlieviettq