clean-data

Featured

Interactive data profiling and cleaning assistant for medical research. Three-stage workflow (profile, flag, code-generate) with user approval gates at each step. Handles missing values, outliers, duplicates, and type mismatches in CSV/Excel clinical data. Does NOT auto-clean — all decisions require researcher confirmation.

Data & Documents 220 stars 55 forks Updated today MIT

Install

View on GitHub

Quality Score: 95/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# Data Profiling and Cleaning Skill You are assisting a medical researcher with data profiling and cleaning for clinical datasets. This is a three-stage interactive workflow. You generate code and reports -- you do NOT auto-clean data. Every cleaning decision requires explicit researcher confirmation. ## Philosophy This skill is a PROFILING AND FLAGGING ASSISTANT, not an automated data cleaner. Clinical data cleaning requires domain expertise that an LLM cannot replace. Every cleaning decision must be confirmed by the researcher. **DATA PRIVACY WARNING** If your dataset contains Protected Health Information (PHI) or Personally Identifiable Information (PII), run `/deidentify` first to remove PHI before proceeding. The deidentify skill provides a standalone Python script (no LLM) that scans for Korean SSN, phone numbers, names, dates, and addresses, then anonymizes them with your confirmation. If `*_deidentified.*` files exist in the working directory, use those instead of raw data. Alternatively: 1. Provide only the data dictionary / codebook for profiling guidance 2. Or use a local-only environment with no network access This tool generates CODE that runs on your data -- it does not need to see the raw data to generate useful profiling scripts. ## Reference Files - **Profiling template**: `${CLAUDE_SKILL_DIR}/references/profiling_template.py` -- reusable profiling script - **Cleaning patterns**: `${CLAUDE_SKILL_DIR}/references/cleaning_patterns.md` -- common clinic...

Details

Author: Aperivue
Repository: Aperivue/medsci-skills
Created: 3 months ago
Last Updated: today
Language: Python
License: MIT

Bundled in these plugins

medsci-skills

Similar Skills

Semantically similar based on skill content — not just same category

Code & Development Featured

deidentify

De-identify clinical research data before LLM-assisted analysis. Standalone Python CLI detects PHI via regex + heuristics with 10 country locale packs (kr, us, jp, cn, de, uk, fr, ca, au, in). Interactive terminal review. No LLM touches raw data — the script runs locally without any network or AI calls.

220 Updated today

Aperivue

Data & Documents Solid

data-cleaning-brief

Writes clear, step-by-step instructions for cleaning a messy or inconsistent dataset — specifying exactly what needs to be standardised, corrected, or removed to make the data ready for analysis and publication.

16 Updated today

ur-grue

Data & Documents Featured

analyze-stats

Statistical analysis for medical research papers. Generates reproducible Python/R code with publication-ready tables and figures. Supports diagnostic accuracy, inter-rater agreement, meta-analysis, survival analysis, survey data, group comparisons, regression, propensity score, and repeated measures.

220 Updated today

Aperivue