agent-evaluationlisted
Install: claude install-skill viktorbezdek/skillstack
# Evaluating LLM Agent Systems
Agent evaluation requires fundamentally different approaches than traditional software testing. Agents make dynamic decisions, are non-deterministic, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback.
**Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
## When to Activate
- Testing agent performance systematically
- Validating context engineering choices
- Measuring improvements or catching regressions over time
- Building quality gates for agent pipelines
- Comparing different agent configurations or model outputs
- Building automated evaluation pipelines for LLM outputs
- Designing A/B tests for prompt or model changes
- Debugging evaluation systems that show inconsistent results
- Analyzing correlation between automated and human judgments
## Decision Tree: Choosing an Evaluation Approach
```
What are you evaluating?
+-- Agent outputs against known correct answers?
| +-- Yes --> Direct Scoring (factual accuracy, format compliance, instruction following)
| +-- No --> Are you comparing two configurations?
| +-- Yes --> Pairwise Comparison with position-swap protocol
| | Criteria: tone, style, persuasiveness, creativity
| +-- No --> Do you have reference m