agent-evallisted

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
goharabbas321/zeoel-framework · ★ 0 · AI & Automation · score 65

Install: claude install-skill goharabbas321/zeoel-framework

# Agent Eval Skill A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it. ## When to Activate - Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase - Measuring agent performance before adopting a new tool or model - Running regression checks when an agent updates its model or tooling - Producing data-backed agent selection decisions for a team ## Installation > **Note:** Install agent-eval from its repository after reviewing the source. ## Core Concepts ### YAML Task Definitions Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success: ```yaml name: add-retry-logic description: Add exponential backoff retry to the HTTP client repo: ./my-project files: - src/http_client.py prompt: | Add retry logic with exponential backoff to all HTTP requests. Max 3 retries. Initial delay 1s, max delay 30s. judge: - type: pytest command: pytest tests/test_http_client.py -v - type: grep pattern: "exponential_backoff|retry" files: src/http_client.py commit: "abc1234" # pin to specific commit for reproducibility ``` ### Git Worktree Isolation Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo. ### Metrics Collected | Metric | What It Measures | |-