eval-report

Solid

Generate an aggregate agent quality report from evaluation results, showing scores, regressions, and recommendations

AI & Automation 142 stars 21 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 90/100

Stars 20%
72
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# Evaluation Report Generate a quality report from accumulated evaluation results. ## Research Foundation - **REF-001**: BP-9 - Continuous evaluation of agent performance - **REF-002**: KAMI benchmark methodology for real agentic task evaluation ## Usage ```bash /eval-report /eval-report --output .aiwg/reports/quality-report.md /eval-report --compare previous-report.json /eval-report --mode sdlc --format json ``` ## Options | Option | Default | Description | |--------|---------|-------------| | --output | stdout | Output file path | | --compare | none | Previous report to diff against | | --mode | all | Agent category: sdlc, marketing, forensics, all | | --format | markdown | Output format: markdown, json | | --since | none | Only include results after this date (ISO 8601) | | --threshold | 0.85 | Score below this triggers a warning | ## Process 1. **Collect Results**: Read all `eval-*.json` files from `.aiwg/reports/` 2. **Aggregate Scores**: Compute per-agent and per-archetype scores 3. **Detect Regressions**: Compare against --compare baseline if provided 4. **Rank Agents**: Sort by overall score, flag below-threshold agents 5. **Build Recommendations**: Surface specific agents and archetypes needing attention 6. **Output Report**: Write markdown or JSON to --output or stdout ## Report Sections ### Summary Dashboard Overall health at a glance — total agents tested, aggregate score, regression count. ### By Archetype Pass rates per Roig (2025) failure archetyp...

Details

Author
jmagly
Repository
jmagly/aiwg
Created
9 months ago
Last Updated
yesterday
Language
TypeScript
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category