ai-evaluationlisted

Use when designing an evaluation framework for AI/LLM features. Covers golden dataset creation, automated scoring rubrics, hallucination detection, regression testing infrastructure, and production monitoring. Do not use for prompt design (use prompt-engineering) or RAG pipeline architecture (use rag-architecture).
dtsong/my-claude-setup · ★ 5 · AI & Automation · score 76

Install: claude install-skill dtsong/my-claude-setup

# AI Evaluation ## Purpose Design an evaluation framework for AI/LLM features, including golden dataset creation, automated scoring rubrics, hallucination detection, and regression testing infrastructure. ## Scope Constraints Reads feature specifications, evaluation requirements, and existing infrastructure details for framework design. Does not execute model inference, create production datasets, or access live model endpoints directly. ## Inputs - AI feature being evaluated (what it does, expected behavior) - Input data examples and edge cases - Quality requirements (accuracy thresholds, hallucination tolerance) - Existing evaluation infrastructure (if any) - Production monitoring requirements ## Input Sanitization No user-provided values are used in commands or file paths. All inputs are treated as read-only analysis targets. ## Procedure ### Progress Checklist - [ ] Step 1: Define evaluation dimensions - [ ] Step 2: Build golden dataset - [ ] Step 3: Design automated scoring - [ ] Step 4: Design hallucination detection - [ ] Step 5: Design regression testing - [ ] Step 6: Design production monitoring ### Step 1: Define Evaluation Dimensions Identify what "good" means for this feature: - **Correctness:** Does the output match the expected answer? - **Faithfulness:** Does the output only use information from the provided context? - **Relevance:** Does the output answer the actual question asked? - **Completeness:** Does the output cover all aspects of the questi