agent-eval

Solid

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

AI & Automation 201,447 stars 30903 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%

100

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Agent Eval Skill A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it. ## When to Activate - Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase - Measuring agent performance before adopting a new tool or model - Running regression checks when an agent updates its model or tooling - Producing data-backed agent selection decisions for a team ## Installation > **Note:** Install agent-eval from its repository after reviewing the source. ## Core Concepts ### YAML Task Definitions Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success: ```yaml name: add-retry-logic description: Add exponential backoff retry to the HTTP client repo: ./my-project files: - src/http_client.py prompt: | Add retry logic with exponential backoff to all HTTP requests. Max 3 retries. Initial delay 1s, max delay 30s. judge: - type: pytest command: pytest tests/test_http_client.py -v - type: grep pattern: "exponential_backoff|retry" files: src/http_client.py commit: "abc1234" # pin to specific commit for reproducibility ``` ### Git Worktree Isolation Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo. ### Metrics Collected | Metric | What It Measures | |-...

Details

Author: affaan-m
Repository: affaan-m/everything-claude-code
Created: 4 months ago
Last Updated: yesterday
Language: JavaScript
License: MIT

Integrates with

Anthropic · AI pytest · Testing

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

0 Updated yesterday

Methasit-Pun

AI & Automation Listed

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

0 Updated 2 days ago

goharabbas321

AI & Automation Listed

agent-evaluation

Evaluate LLM agents and tool-using workflows—task success, tool accuracy, latency/cost, safety, and regression suites. Use when shipping agent features, comparing prompts/models, or debugging agent failures.

15 Updated 2 days ago

charlieviettq

AI & Automation Solid

eval-agent

Run evaluation tests against an agent to assess quality and archetype resistance

142 Updated yesterday

jmagly

AI & Automation Listed

agent-eval

【Agent 评估】评估 AI Agent 输出质量。触发时机：用户说"评估 agent"、"测试 agent 质量"、"agent eval"、"检查 agent 输出"时。

0 Updated 2 days ago

afine907