ai-evaluationlisted
Install: claude install-skill dtsong/agentic-council
# AI Evaluation
## Purpose
Design an evaluation framework for AI/LLM features, including golden dataset creation, automated scoring rubrics, hallucination detection, and regression testing infrastructure.
## Scope Constraints
Reads feature specifications, evaluation requirements, and existing infrastructure details for framework design. Does not execute model inference, create production datasets, or access live model endpoints directly.
## Inputs
- AI feature being evaluated (what it does, expected behavior)
- Input data examples and edge cases
- Quality requirements (accuracy thresholds, hallucination tolerance)
- Existing evaluation infrastructure (if any)
- Production monitoring requirements
## Input Sanitization
No user-provided values are used in commands or file paths. All inputs are treated as read-only analysis targets.
## Procedure
### Progress Checklist
- [ ] Step 1: Define evaluation dimensions
- [ ] Step 2: Build golden dataset
- [ ] Step 3: Design automated scoring
- [ ] Step 4: Design hallucination detection
- [ ] Step 5: Design regression testing
- [ ] Step 6: Design production monitoring
### Step 1: Define Evaluation Dimensions
Identify what "good" means for this feature:
- **Correctness:** Does the output match the expected answer?
- **Faithfulness:** Does the output only use information from the provided context?
- **Relevance:** Does the output answer the actual question asked?
- **Completeness:** Does the output cover all aspects of the questi