dvc-dataset-versioning

Solid

Dataset versioning skill using DVC for tracking data changes, managing data pipelines, and ensuring reproducibility.

Data & Documents 1,160 stars 71 forks Updated today MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# dvc-dataset-versioning ## Overview Dataset versioning skill using DVC (Data Version Control) for tracking data changes, managing data pipelines, and ensuring reproducibility in ML workflows. ## Capabilities - Dataset version tracking - Data pipeline definition and execution - Remote storage management (S3, GCS, Azure, etc.) - Reproducibility enforcement - Data lineage tracking - Experiment comparison with data versions - Cache management for large datasets ## Target Processes - Data Collection and Validation Pipeline - ML Model Retraining Pipeline - Feature Store Implementation ## Tools and Libraries - DVC - Git - Remote storage SDKs (boto3, google-cloud-storage, etc.) ## Input Schema ```json { "type": "object", "required": ["action"], "properties": { "action": { "type": "string", "enum": ["init", "add", "push", "pull", "diff", "checkout", "run", "repro"], "description": "DVC action to perform" }, "paths": { "type": "array", "items": { "type": "string" }, "description": "File or directory paths to track" }, "remote": { "type": "string", "description": "Remote storage name" }, "revision": { "type": "string", "description": "Git revision for checkout/diff" }, "pipeline": { "type": "object", "description": "Pipeline stage definition for run action" } } } ``` ## Output Schema ```json { "type": "object", "required": ["status", "action"], "prop...

Details

Author
a5c-ai
Repository
a5c-ai/babysitter
Created
4 months ago
Last Updated
today
Language
JavaScript
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category

Data & Documents Listed

version-dataset

Dataset version control for research reproducibility. Builds a deterministic content-hash manifest of a dataset (file SHA-256 + tabular schema + per-column value hashes), verifies a later copy against it to detect drift (schema change, row-count change, value changes), and diffs two manifests. Use to prove an analysis ran on the intended data, lock a dataset version, or reproducibility-lock bundled demos.

126 Updated today
Aperivue
AI & Automation Solid

data-versioning-manager

Skill for managing data versions and provenance

1,160 Updated today
a5c-ai
Data & Documents Listed

dataset-curator

Use this skill when designing, cleaning, deduplicating, or documenting datasets for model training and evaluation including schema design, class imbalance handling, and train/val/test splits. Not for running model training or hyperparameter tuning. Not for real-time data pipeline engineering.

15 Updated 2 days ago
NickCrew
Data & Documents Solid

dataset-transformation

Generates a Jupyter notebook that transforms datasets between ML schemas for model training or evaluation. Use when the user says "transform", "convert", "reformat", "change the format", or when a dataset's schema needs to change to match the target format — always use this skill for format changes rather than writing inline transformation code. Supports OpenAI chat, SageMaker SFT/DPO/RLVR, HuggingFace preference, Bedrock Nova, VERL, and custom JSONL formats from local files or S3.

765 Updated 2 days ago
awslabs
Data & Documents Listed

hugging-face-datasets

Create and manage datasets on Hugging Face Hub. Supports initializing repos, defining configs/system prompts, streaming row updates, and SQL-based dataset querying/transformation. Designed to work alongside HF MCP server for comprehensive dataset workflows.

3 Updated today
tayyabexe