eval-harness
SolidEvaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.
AI & Automation 1,160 stars
71 forks Updated today MIT
Install
Quality Score: 94/100
Stars 20%
Recency 20%
Frontmatter 20%
Documentation 15%
Issue Health 10%
License 10%
Description 5%
Skill Content
# Eval Harness
## Overview
Evaluation harness methodology adapted from the Everything Claude Code project. Provides structured frameworks for benchmarking agent performance, testing skill quality, and running regression suites.
## Evaluation Types
### 1. Agent Performance Benchmark
- Define test cases with known-correct outputs
- Run agent against each test case
- Score: accuracy, completeness, relevance
- Compare against baseline performance
- Track performance over time
### 2. Skill Quality Testing
- Verify skill instructions produce expected outcomes
- Test edge cases and boundary conditions
- Measure consistency across multiple runs
- Check for harmful or incorrect outputs
- Validate against ground truth
### 3. Regression Suite
- Collection of previously-passing test cases
- Run after any agent/skill modification
- Flag regressions with before/after comparison
- Maintain pass rate threshold (>= 95%)
### 4. Process Verification
- End-to-end process execution with known inputs
- Verify each phase produces expected outputs
- Check task ordering and dependency satisfaction
- Measure total execution time
## Quality Scoring
### Accuracy Score (0-100)
- Correctness of output vs expected
- Partial credit for partially correct outputs
- Penalty for hallucinated or fabricated content
### Completeness Score (0-100)
- Coverage of required output elements
- Missing sections flagged and scored
- Bonus for useful additional context
### Consistency Score (0-100)
- Run same inp...
Details
- Author
- a5c-ai
- Repository
- a5c-ai/babysitter
- Created
- 4 months ago
- Last Updated
- today
- Language
- JavaScript
- License
- MIT
Similar Skills
Semantically similar based on skill content — not just same category
AI & Automation Solid
eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
199,470 Updated yesterday
affaan-m AI & Automation Listed
eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
0 Updated yesterday
uzysjung AI & Automation Solid
eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
496 Updated 1 months ago
vibeeval AI & Automation Solid
eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
54 Updated today
arabicapp AI & Automation Listed
eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
4 Updated today
immacualate