eval-report
SolidGenerate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations
AI & Automation 142 stars
21 forks Updated yesterday MIT
Install
Quality Score: 90/100
Stars 20%
Recency 20%
Frontmatter 20%
Documentation 15%
Issue Health 10%
License 10%
Description 5%
Skill Content
# Evaluation Report
Generate a quality report from accumulated evaluation results.
## Research Foundation
- **REF-001**: BP-9 - Continuous evaluation of agent performance
- **REF-002**: KAMI benchmark methodology for real agentic task evaluation
## Usage
```bash
/eval-report
/eval-report --output .aiwg/reports/quality-report.md
/eval-report --compare previous-report.json
/eval-report --mode sdlc --format json
```
## Options
| Option | Default | Description |
|--------|---------|-------------|
| --output | stdout | Output file path |
| --compare | none | Previous report to diff against |
| --mode | all | Agent category: sdlc, marketing, forensics, all |
| --format | markdown | Output format: markdown, json |
| --since | none | Only include results after this date (ISO 8601) |
| --threshold | 0.85 | Score below this triggers a warning |
## Process
1. **Collect Results**: Read all `eval-*.json` files from `.aiwg/reports/`
2. **Aggregate Scores**: Compute per-agent and per-archetype scores
3. **Detect Regressions**: Compare against --compare baseline if provided
4. **Rank Agents**: Sort by overall score, flag below-threshold agents
5. **Build Recommendations**: Surface specific agents and archetypes needing attention
6. **Output Report**: Write markdown or JSON to --output or stdout
## Report Sections
### Summary Dashboard
Overall health at a glance — total agents tested, aggregate score, regression count.
### By Archetype
Pass rates per Roig (2025) failure archetyp...
Details
- Author
- jmagly
- Repository
- jmagly/aiwg
- Created
- 9 months ago
- Last Updated
- yesterday
- Language
- TypeScript
- License
- MIT
Integrates with
Similar Skills
Semantically similar based on skill content — not just same category
AI & Automation Solid
eval-agent
Run evaluation tests against an agent to assess quality and archetype resistance
142 Updated yesterday
jmagly AI & Automation Solid
eval
Evaluate and rank agent results by metric or LLM judge for an AgentHub session.
16,782 Updated 3 days ago
alirezarezvani AI & Automation Listed
agent-eval
【Agent 评估】评估 AI Agent 输出质量。触发时机:用户说"评估 agent"、"测试 agent 质量"、"agent eval"、"检查 agent 输出"时。
0 Updated 2 days ago
afine907 AI & Automation Solid
agent-eval
Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
201,447 Updated yesterday
affaan-m AI & Automation Listed
agent-eval
Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
0 Updated yesterday
Methasit-Pun