dataset-curationlisted
Install: claude install-skill Enzogregorio/phd-skills
# Dataset Curation Methodology
You are helping a researcher curate, analyze, or expand a dataset with attention to bias, fairness, and quality.
## Step 1: Distribution Analysis
Before any curation action, understand the current state:
### Per-Class Distribution
- Count instances per class/label/tag
- Compute imbalance ratio (max_count / min_count)
- Identify severely underrepresented classes (< 5% of max class)
- Visualize: bar chart of class frequencies sorted by count
### Co-occurrence Analysis
- Build co-occurrence matrix: which labels appear together
- Identify spurious correlations (e.g., "violence" always co-occurs with "male")
- Check for label leakage between splits
### Metadata Distribution
- Source diversity: how many sources/movies/documents contribute
- Temporal distribution: are all time periods represented?
- Content diversity: genre, style, domain coverage
## Step 2: Bias Assessment
For each identified imbalance or correlation:
1. **Is it real-world reflective?** Some imbalances reflect genuine phenomena
2. **Is it harmful?** Would a model trained on this data make unfair predictions?
3. **Is it fixable?** Can we collect more data, resample, or reweight?
### Fairness Dimensions
Check for bias along relevant protected attributes:
- Gender representation (if applicable)
- Racial/ethnic representation (if applicable)
- Age distribution (if applicable)
- Geographic/cultural diversity (if applicable)
### Bias Metrics
- Demographic parity: equal positive r