nemo-curator

Featured

GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.

AI & Automation 27,984 stars 2901 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# NeMo Curator - GPU-Accelerated Data Curation NVIDIA's toolkit for preparing high-quality training data for LLMs. ## When to use NeMo Curator **Use NeMo Curator when:** - Preparing LLM training data from web scrapes (Common Crawl) - Need fast deduplication (16× faster than CPU) - Curating multi-modal datasets (text, images, video, audio) - Filtering low-quality or toxic content - Scaling data processing across GPU cluster **Performance**: - **16× faster** fuzzy deduplication (8TB RedPajama v2) - **40% lower TCO** vs CPU alternatives - **Near-linear scaling** across GPU nodes **Use alternatives instead**: - **datatrove**: CPU-based, open-source data processing - **dolma**: Allen AI's data toolkit - **Ray Data**: General ML data processing (no curation focus) ## Quick start ### Installation ```bash # Text curation (CUDA 12) uv pip install "nemo-curator[text_cuda12]" # All modalities uv pip install "nemo-curator[all_cuda12]" # CPU-only (slower) uv pip install "nemo-curator[cpu]" ``` ### Basic text curation pipeline ```python from nemo_curator import ScoreFilter, Modify from nemo_curator.datasets import DocumentDataset import pandas as pd # Load data df = pd.DataFrame({"text": ["Good document", "Bad doc", "Excellent text"]}) dataset = DocumentDataset(df) # Quality filtering def quality_score(doc): return len(doc["text"].split()) > 5 # Filter short docs filtered = ScoreFilter(quality_score)(dataset) # Deduplication from nemo_curator.modules import ExactDuplic...

Details

Author
davila7
Repository
davila7/claude-code-templates
Created
11 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category