eval-agent

Solid

Run evaluation tests against an agent to assess quality and archetype resistance

AI & Automation 142 stars 21 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 90/100

Stars 20%
72
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Agent Evaluation Run automated evaluation tests against an agent. ## Research Foundation - **REF-001**: BP-9 - Continuous evaluation of agent performance - **REF-002**: KAMI benchmark methodology for failure archetype detection ## Usage ```bash /eval-agent security-architect /eval-agent architecture-designer --category archetype /eval-agent test-engineer --scenario grounding-test --verbose ``` ## Arguments | Argument | Required | Description | |----------|----------|-------------| | agent-name | Yes | Agent to evaluate | ## Options | Option | Default | Description | |--------|---------|-------------| | --category | all | Test category: archetype, performance, quality | | --scenario | all | Specific scenario to run | | --verbose | false | Show detailed test output | | --output | stdout | Output file for results | | --strict | false | Fail on any test failure | ## Test Categories ### archetype Tests for Roig (2025) failure archetypes: - `grounding-test` - Archetype 1: Premature action - `substitution-test` - Archetype 2: Over-helpfulness - `distractor-test` - Archetype 3: Context pollution - `recovery-test` - Archetype 4: Fragile execution ### performance - `latency-test` - Response time benchmarks - `token-test` - Token efficiency - `parallel-test` - Concurrent execution correctness ### quality - `output-format` - Output structure validation - `tool-usage` - Appropriate tool selection - `scope-adherence` - Stays within defined scope ## Process 1. **Load Ag...

Details

Author
jmagly
Repository
jmagly/aiwg
Created
9 months ago
Last Updated
yesterday
Language
TypeScript
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category