design-ai-benchmarking

Featured

Design and validity review for studies that benchmark one or more AI systems against a human-expert panel as the reference. Covers the evaluation question and arm definition, decoupled multi-dimensional rubrics with anchors, planted calibration probes, reviewer-panel construction, inter-rater reliability targets, LLM-as-judge versus human-as-judge adjudication, construct-independence guards, and a structured rating-export schema. Use before data collection on an AI-vs-expert evaluation.

AI & Automation 223 stars 55 forks Updated yesterday MIT

Install

View on GitHub

Quality Score: 95/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Design-AI-Benchmarking Skill ## Purpose This skill pressure-tests an AI-vs-human-expert benchmark **before any ratings are collected**, so that the comparison is fair, the rubric measures distinct constructs, the scale is calibrated, and the reported reliability is interpretable. It is the AI-evaluation specialization of `/design-study`: where `/design-study` reviews a study in general, this skill owns the specific machinery of comparing AI system(s) to a panel of human experts (or to each other) on rated outputs. Use it when: - one or more AI systems will be scored against a human-expert reference (reader study, annotation panel, AI-output evaluation, model-vs-model bench) - a rubric and rating protocol must be locked before reviewers begin - a benchmark feels vulnerable to "the highest score is just the most tautological item" or "low agreement, but we cannot tell why" criticism - a reviewer or editor asks how the evaluation controlled for rater drift, leakage, or judge bias Do **not** use it for: general study/validity review (use `/design-study`); statistical execution such as ICC or DeLong (use `/analyze-stats`); reporting-guideline item audits (use `/check-reporting`); or reviewing an already-written manuscript (use `/peer-review` or `/self-review`). --- ## Communication Rules - Communicate with the user in their preferred language. - Use English for statistical, machine-learning, and reporting-guideline terminology. - Be direct about evaluation-validity ri...

Details

Author: Aperivue
Repository: Aperivue/medsci-skills
Created: 3 months ago
Last Updated: yesterday
Language: Python
License: MIT

Bundled in these plugins

medsci-skills

Similar Skills

Semantically similar based on skill content — not just same category

Web & Frontend Featured

design-study

Study design and validity review for radiology and medical AI research. Identifies analysis unit, cohort logic, leakage risks, comparator design, validation strategy, and reporting guideline fit before drafting or submission.

223 Updated yesterday

Aperivue

AI & Automation Featured

ai-assisted-performance-review

Evaluate performance fairly when output is AI-assisted — what still measures the human, what now measures the tooling, and how to run the review conversation. Use when reviewing someone whose work is heavily AI-assisted, when output volume stopped meaning anything, when calibrating a team with uneven AI adoption, or when writing review criteria for the AI era. Produces review guidance: a what-measures-whom analysis, rewritten criteria, calibration rules for mixed-adoption teams, and conversation scripts. For the general review document use performance-review; for redesigning the role itself use role-redesign-for-ai.

1,231 Updated today

mohitagw15856

AI & Automation Listed

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

0 Updated yesterday

mytricker0