agent-evallisted

Run head-to-head comparisons of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks, reporting pass rate, cost, time, and consistency metrics. USE WHEN choosing between coding agents or benchmarking agent performance on representative tasks.
Sheshiyer/skill-clusters · ★ 0 · AI & Automation · score 72

Install: claude install-skill Sheshiyer/skill-clusters

# Agent Eval Skill A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it. ## When to Activate - Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase - Measuring agent performance before adopting a new tool or model - Running regression checks when an agent updates its model or tooling - Producing data-backed agent selection decisions for a team ## Installation > **Note:** Install agent-eval from its repository after reviewing the source. ## Core Concepts ### YAML Task Definitions Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success: ```yaml name: add-retry-logic description: Add exponential backoff retry to the HTTP client repo: ./my-project files: - src/http_client.py prompt: | Add retry logic with exponential backoff to all HTTP requests. Max 3 retries. Initial delay 1s, max delay 30s. judge: - type: pytest command: pytest tests/test_http_client.py -v - type: grep pattern: "exponential_backoff|retry" files: src/http_client.py commit: "abc1234" # pin to specific commit for reproducibility ``` ### Git Worktree Isolation Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo. ### Metrics Collected | Metric | What It Measures | |-