eval

Solid

Evaluate and rank agent results by metric or LLM judge for an AgentHub session.

AI & Automation 16,782 stars 2310 forks Updated 3 days ago MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# /hub:eval — Evaluate Agent Results Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid. ## Usage ``` /hub:eval # Eval latest session using configured criteria /hub:eval 20260317-143022 # Eval specific session /hub:eval --judge # Force LLM judge mode (ignore metric config) ``` ## What It Does ### Metric Mode (eval command configured) Run the evaluation command in each agent's worktree: ```bash python {skill_path}/scripts/result_ranker.py \ --session {session-id} \ --eval-cmd "{eval_cmd}" \ --metric {metric} --direction {direction} ``` Output: ``` RANK AGENT METRIC DELTA FILES 1 agent-2 142ms -38ms 2 2 agent-1 165ms -15ms 3 3 agent-3 190ms +10ms 1 Winner: agent-2 (142ms) ``` ### LLM Judge Mode (no eval command, or --judge flag) For each agent: 1. Get the diff: `git diff {base_branch}...{agent_branch}` 2. Read the agent's result post from `.agenthub/board/results/agent-{i}-result.md` 3. Compare all diffs and rank by: - **Correctness** — Does it solve the task? - **Simplicity** — Fewer lines changed is better (when equal correctness) - **Quality** — Clean execution, good structure, no regressions Present rankings with justification. Example LLM judge output for a content task: ``` RANK AGENT VERDICT WORD COUNT 1 age...

Details

Author: alirezarezvani
Repository: alirezarezvani/claude-skills
Created: 7 months ago
Last Updated: 3 days ago
Language: Python
License: MIT

Integrates with

OpenAI · AI Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

eval-agent

Run evaluation tests against an agent to assess quality and archetype resistance

142 Updated yesterday

jmagly

AI & Automation Solid

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

201,447 Updated yesterday

affaan-m

AI & Automation Listed

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

0 Updated yesterday

Methasit-Pun

AI & Automation Listed

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

0 Updated 2 days ago

goharabbas321

AI & Automation Listed

agent-evaluation

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", "implement LLM-as-judge", "compare model outputs", "mitigate evaluation bias", or mentions multi-dimensional evaluation, agent testing, quality gates, direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment for LLM agent systems. NOT for testing code or applications (use testing-framework), NOT for agent coordination or multi-agent design (use multi-agent-patterns).

9 Updated 2 days ago

viktorbezdek