design-ai-benchmarkinglisted
Install: claude install-skill Aperivue/medsci-skills
# Design-AI-Benchmarking Skill
## Purpose
This skill pressure-tests an AI-vs-human-expert benchmark **before any ratings are collected**, so that
the comparison is fair, the rubric measures distinct constructs, the scale is calibrated, and the
reported reliability is interpretable. It is the AI-evaluation specialization of `/design-study`: where
`/design-study` reviews a study in general, this skill owns the specific machinery of comparing AI
system(s) to a panel of human experts (or to each other) on rated outputs.
Use it when:
- one or more AI systems will be scored against a human-expert reference (reader study, annotation
panel, AI-output evaluation, model-vs-model bench)
- a rubric and rating protocol must be locked before reviewers begin
- a benchmark feels vulnerable to "the highest score is just the most tautological item" or
"low agreement, but we cannot tell why" criticism
- a reviewer or editor asks how the evaluation controlled for rater drift, leakage, or judge bias
Do **not** use it for: general study/validity review (use `/design-study`); statistical execution such
as ICC or DeLong (use `/analyze-stats`); reporting-guideline item audits (use `/check-reporting`);
or reviewing an already-written manuscript (use `/peer-review` or `/self-review`).
---
## Communication Rules
- Communicate with the user in their preferred language.
- Use English for statistical, machine-learning, and reporting-guideline terminology.
- Be direct about evaluation-validity ri