eval-agent

Solid

Run evaluation tests against an agent to assess quality and archetype resistance

AI & Automation 142 stars 21 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 90/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Agent Evaluation Run automated evaluation tests against an agent. ## Research Foundation - **REF-001**: BP-9 - Continuous evaluation of agent performance - **REF-002**: KAMI benchmark methodology for failure archetype detection ## Usage ```bash /eval-agent security-architect /eval-agent architecture-designer --category archetype /eval-agent test-engineer --scenario grounding-test --verbose ``` ## Arguments | Argument | Required | Description | |----------|----------|-------------| | agent-name | Yes | Agent to evaluate | ## Options | Option | Default | Description | |--------|---------|-------------| | --category | all | Test category: archetype, performance, quality | | --scenario | all | Specific scenario to run | | --verbose | false | Show detailed test output | | --output | stdout | Output file for results | | --strict | false | Fail on any test failure | ## Test Categories ### archetype Tests for Roig (2025) failure archetypes: - `grounding-test` - Archetype 1: Premature action - `substitution-test` - Archetype 2: Over-helpfulness - `distractor-test` - Archetype 3: Context pollution - `recovery-test` - Archetype 4: Fragile execution ### performance - `latency-test` - Response time benchmarks - `token-test` - Token efficiency - `parallel-test` - Concurrent execution correctness ### quality - `output-format` - Output structure validation - `tool-usage` - Appropriate tool selection - `scope-adherence` - Stays within defined scope ## Process 1. **Load Ag...

Details

Author: jmagly
Repository: jmagly/aiwg
Created: 9 months ago
Last Updated: yesterday
Language: TypeScript
License: MIT

Integrates with

Anthropic · AI

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

eval-report

Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations

142 Updated yesterday

jmagly

AI & Automation Solid

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

201,447 Updated yesterday

affaan-m

AI & Automation Listed

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

0 Updated yesterday

Methasit-Pun

AI & Automation Listed

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

0 Updated 2 days ago

goharabbas321

AI & Automation Listed

agent-eval

【Agent 评估】评估 AI Agent 输出质量。触发时机：用户说"评估 agent"、"测试 agent 质量"、"agent eval"、"检查 agent 输出"时。

0 Updated 2 days ago

afine907