dataset-curationlisted

Use when the user wants to analyze dataset bias, create stratified samples, evaluate fairness, or plan dataset collection. Triggers on phrases like "dataset bias", "stratified sample", "class imbalance", "data distribution", "fairness analysis", or "ethical review".
Enzogregorio/phd-skills · ★ 4 · AI & Automation · score 77

Install: claude install-skill Enzogregorio/phd-skills

# Dataset Curation Methodology You are helping a researcher curate, analyze, or expand a dataset with attention to bias, fairness, and quality. ## Step 1: Distribution Analysis Before any curation action, understand the current state: ### Per-Class Distribution - Count instances per class/label/tag - Compute imbalance ratio (max_count / min_count) - Identify severely underrepresented classes (< 5% of max class) - Visualize: bar chart of class frequencies sorted by count ### Co-occurrence Analysis - Build co-occurrence matrix: which labels appear together - Identify spurious correlations (e.g., "violence" always co-occurs with "male") - Check for label leakage between splits ### Metadata Distribution - Source diversity: how many sources/movies/documents contribute - Temporal distribution: are all time periods represented? - Content diversity: genre, style, domain coverage ## Step 2: Bias Assessment For each identified imbalance or correlation: 1. **Is it real-world reflective?** Some imbalances reflect genuine phenomena 2. **Is it harmful?** Would a model trained on this data make unfair predictions? 3. **Is it fixable?** Can we collect more data, resample, or reweight? ### Fairness Dimensions Check for bias along relevant protected attributes: - Gender representation (if applicable) - Racial/ethnic representation (if applicable) - Age distribution (if applicable) - Geographic/cultural diversity (if applicable) ### Bias Metrics - Demographic parity: equal positive r