agent-evaluationlisted
Install: claude install-skill dzianisv/opencode-plugins
# Agent Evaluation Skill
Evaluate AI agent task execution using world-class LLM-as-judge patterns from DeepEval, RAGAS, and G-Eval frameworks.
## Output Format
Evaluation results are saved to `evals/results/eval-${yyyy-mm-dd-hh-mm}-${commit_id}.md`
### Results Table
| Task Input | Agent Output | Reflection Input | Reflection Output | Score | Verdict | Feedback |
|------------|--------------|------------------|-------------------|-------|---------|----------|
| Create hello.js... | I've created hello.js with... | Task: Create hello.js Agent Output: ... | Task complete | 5/5 | COMPLETE | Agent produced output; Found completion indicators |
| Fix the bug... | I found the issue and... | Task: Fix bug Agent Output: ... | (none) | 3/5 | PARTIAL | Agent produced output; Missing reflection |
### Run Evaluation
```bash
# Run E2E evaluation
npx tsx eval.ts
# Or via npm
npm run eval:e2e
# Output saved to: evals/results/eval-2026-01-28-12-30-abc1234.md
```
---
## Evaluation Rubric (0-5)
| Score | Verdict | Criteria |
|-------|---------|----------|
| **5** | COMPLETE | Task fully accomplished. All requirements met. Optimal execution. |
| **4** | MOSTLY_COMPLETE | Task done with minor issues. 1-2 suboptimal steps. |
| **3** | PARTIAL | Core objective achieved but significant gaps or errors. |
| **2** | ATTEMPTED | Progress made but failed to complete. Correct intent, wrong execution. |
| **1** | FAILED | Wrong approach or incorrect result. |
| **0** | NO_ATTEMPT | No meaningful