eval-harness
SolidFormal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
AI & Automation 199,470 stars
30623 forks Updated yesterday MIT
Install
Quality Score: 96/100
Stars 20%
Recency 20%
Frontmatter 20%
Documentation 15%
Issue Health 10%
License 10%
Description 5%
Skill Content
# Eval Harness Skill
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
## When to Activate
- Setting up eval-driven development (EDD) for AI-assisted workflows
- Defining pass/fail criteria for Claude Code task completion
- Measuring agent reliability with pass@k metrics
- Creating regression test suites for prompt or agent changes
- Benchmarking agent performance across model versions
## Philosophy
Eval-Driven Development treats evals as the "unit tests of AI development":
- Define expected behavior BEFORE implementation
- Run evals continuously during development
- Track regressions with each change
- Use pass@k metrics for reliability measurement
## Eval Types
### Capability Evals
Test if Claude can do something it couldn't before:
```markdown
[CAPABILITY EVAL: feature-name]
Task: Description of what Claude should accomplish
Success Criteria:
- [ ] Criterion 1
- [ ] Criterion 2
- [ ] Criterion 3
Expected Output: Description of expected result
```
### Regression Evals
Ensure changes don't break existing functionality:
```markdown
[REGRESSION EVAL: feature-name]
Baseline: SHA or checkpoint name
Tests:
- existing-test-1: PASS/FAIL
- existing-test-2: PASS/FAIL
- existing-test-3: PASS/FAIL
Result: X/Y passed (previously Y/Y)
```
## Grader Types
### 1. Code-Based Grader
Deterministic checks using code:
```bash
# Check if file contains expected pattern
grep -q "export function handleAuth" src/auth....
Details
- Author
- affaan-m
- Repository
- affaan-m/ECC
- Created
- 4 months ago
- Last Updated
- yesterday
- Language
- JavaScript
- License
- MIT
Integrates with
Similar Skills
Semantically similar based on skill content — not just same category
AI & Automation Listed
eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
0 Updated yesterday
uzysjung AI & Automation Listed
eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
4 Updated today
immacualate AI & Automation Solid
eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
54 Updated today
arabicapp AI & Automation Solid
eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
496 Updated 1 months ago
vibeeval AI & Automation Solid
eval-harness
Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.
1,160 Updated today
a5c-ai