auto-arena
SolidAutomatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.
Install
Quality Score: 92/100
Skill Content
Details
- Author
- agentscope-ai
- Repository
- agentscope-ai/OpenJudge
- Created
- 10 months ago
- Last Updated
- 3 days ago
- Language
- Python
- License
- Apache-2.0
Integrates with
Similar Skills
Semantically similar based on skill content — not just same category
ref-hallucination-arena
Benchmark LLM reference recommendation capabilities by verifying every cited paper against Crossref, PubMed, arXiv, and DBLP. Measures hallucination rate, per-field accuracy (title/author/year/DOI), discipline breakdown, and year constraint compliance. Supports tool-augmented (ReAct + web search) mode. Use when the user asks to evaluate, benchmark, or compare models on academic reference hallucination, literature recommendation quality, or citation accuracy.
openjudge
Build custom LLM evaluation pipelines using the OpenJudge framework. Covers selecting and configuring graders (LLM-based, function-based, agentic), running batch evaluations with GradingRunner, combining scores with aggregators, applying evaluation strategies (voting, average), auto-generating graders from data, and analyzing results (pairwise win rates, statistics, validation metrics). Use when the user wants to evaluate LLM outputs, compare multiple models, design scoring criteria, or build an automated evaluation system.
agent-arena
Use when complex AI agent work needs heterogeneous multi-agent debate, red teaming, evidence checking, judging, or synthesis across Codex, Claude Code, Hermes, OpenClaw, and other coding agents.
auto-itera
Use when the user wants to autonomously search for the best AI/engineering approach across competing candidates (prompts, models, retrieval strategies, architectures, algorithms) — give it a goal + candidate arms + success threshold, it runs the experiment to a defensible ship-or-kill verdict. Autonomously handles sourcing real production data, scoring arms in parallel, diagnosing per-row, sprint-and-generalize iteration, and writing the conclusion doc. Built-in safeguards (held-out test discipline, variance-floor checks, generalization gates) keep the verdict trustworthy.
agent-evaluation
Evaluate LLM agents and tool-using workflows—task success, tool accuracy, latency/cost, safety, and regression suites. Use when shipping agent features, comparing prompts/models, or debugging agent failures.