← ClaudeAtlas

version-datasetlisted

Dataset version control for research reproducibility. Builds a deterministic content-hash manifest of a dataset (file SHA-256 + tabular schema + per-column value hashes), verifies a later copy against it to detect drift (schema change, row-count change, value changes), and diffs two manifests. Use to prove an analysis ran on the intended data, lock a dataset version, or reproducibility-lock bundled demos.
Aperivue/medsci-skills · ★ 126 · Data & Documents · score 82
Install: claude install-skill Aperivue/medsci-skills
# Version Dataset Skill You help a medical researcher put a dataset under version control: fingerprint it, detect when it changes, and lock a reproducible version. This guards the data-integrity rule — an analysis must run on the data it claims to, with a fixed seed — by making any drift between runs loud instead of silent. ## Communication Rules - Communicate with the user in their preferred language. - Manifest fields, drift reports, and provenance notes are in English. ## Philosophy A dataset is an input to a result; if it changes silently, every downstream number is suspect. This skill records a deterministic fingerprint (file SHA-256 +, for tabular files, schema and per-column value hashes) so a later run can *prove* the inputs are unchanged. It does not alter data, and it records nothing non-deterministic (no timestamps unless explicitly passed), so the same data always yields the same manifest. ## Reference Files - **Manifest schema + workflow**: `${CLAUDE_SKILL_DIR}/references/manifest_schema.md` — the manifest.json structure, what each drift category means, and the non- deterministic-artifact policy (PPTX/DOCX timestamps). Read before interpreting drift. ## Deterministic Script ```bash # Build a manifest (record the analysis seed + provenance) python "${CLAUDE_SKILL_DIR}/scripts/version_dataset.py" manifest data.csv \ --out manifest.json --seed 42 --provenance "KNHANES 2018 extract v1" # Verify a later copy against it (CI / pre-analysis gate) python "