eval-driven-dev
FeaturedSet up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model.
Install
Quality Score: 99/100
Skill Content
Details
- Author
- github
- Repository
- github/awesome-copilot
- Created
- 11 months ago
- Last Updated
- today
- Language
- Python
- License
- MIT
Similar Skills
Semantically similar based on skill content — not just same category
eval-driven-development
Use when reasoning about building language-model-integrated systems by writing evaluations before and alongside the system: the statistical (not binary) nature of LLM evals, the five primitives (dataset, evaluation function, aggregation, iteration loop, regression budget), the judgment-mechanism taxonomy (programmatic, model-graded, human-graded, preference comparison), the difference between system-specific evals and canonical benchmarks (MMLU, HumanEval, BIG-bench, GAIA), how evals drive prompt/model/scaffolding/tooling changes, why Goodhart's Law means higher eval scores are not always improvements, and the offline-eval-vs-production-telemetry distinction. Do NOT use for deterministic unit testing (use testing-strategy), production monitoring (use evaluation or error-tracking), general-software TDD (use testing-strategy), or the construction of individual eval rubrics and task sets (use agent-eval-design — it owns construction; this skill owns the iteration discipline).
eval-driven-dev
Build the evaluation discipline that separates production agentic products from demos — error analysis on real traces, the three-level eval pyramid (code assertions / LLM-as-judge / human review), binary judge outputs calibrated against human labels, and CI gates that block regression. Based on the Husain/Shankar methodology. Use whenever the user mentions evals, evaluation, LLM-as-judge, hallucination testing, regression testing for AI, quality measurement, error analysis, "how do I know if my agent works," failure modes, or grading agent outputs.
ai-evals
Help users create and run AI evaluations. Use when someone is building evals for LLM products, measuring model quality, creating test cases, designing rubrics, or trying to systematically measure AI output quality.
phoenix-evals
Build and run evaluators for AI/LLM applications using Phoenix.
advanced-evaluation
This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.