← ClaudeAtlas

design-ai-benchmarkinglisted

Design and validity review for studies that benchmark one or more AI systems against a human-expert panel as the reference. Covers the evaluation question and arm definition, decoupled multi-dimensional rubrics with anchors, planted calibration probes, reviewer-panel construction, inter-rater reliability targets, LLM-as-judge versus human-as-judge adjudication, construct-independence guards, and a structured rating-export schema. Use before data collection on an AI-vs-expert evaluation.
Aperivue/medsci-skills · ★ 145 · AI & Automation · score 79
Install: claude install-skill Aperivue/medsci-skills
# Design-AI-Benchmarking Skill ## Purpose This skill pressure-tests an AI-vs-human-expert benchmark **before any ratings are collected**, so that the comparison is fair, the rubric measures distinct constructs, the scale is calibrated, and the reported reliability is interpretable. It is the AI-evaluation specialization of `/design-study`: where `/design-study` reviews a study in general, this skill owns the specific machinery of comparing AI system(s) to a panel of human experts (or to each other) on rated outputs. Use it when: - one or more AI systems will be scored against a human-expert reference (reader study, annotation panel, AI-output evaluation, model-vs-model bench) - a rubric and rating protocol must be locked before reviewers begin - a benchmark feels vulnerable to "the highest score is just the most tautological item" or "low agreement, but we cannot tell why" criticism - a reviewer or editor asks how the evaluation controlled for rater drift, leakage, or judge bias Do **not** use it for: general study/validity review (use `/design-study`); statistical execution such as ICC or DeLong (use `/analyze-stats`); reporting-guideline item audits (use `/check-reporting`); or reviewing an already-written manuscript (use `/peer-review` or `/self-review`). --- ## Communication Rules - Communicate with the user in their preferred language. - Use English for statistical, machine-learning, and reporting-guideline terminology. - Be direct about evaluation-validity ri