Nemotron-CC Data Curation#
The Nemotron-CC recipe curates high-quality pretraining data from Common Crawl, producing datasets similar to nvidia/Nemotron-CC-v2. It serves as a reference for running these curation steps on your own data.
Built on NeMo Curator and Ray, the pipeline scales from a single machine to large GPU clusters.
Pipeline#
The recipe is a four-step pipeline that progressively refines raw web data into curated text and synthetic training data:
Common Crawl → Extract & Clean → Deduplicate → Quality Classify → Synthetic Data Generation
Step |
Script |
Description |
Resources |
|---|---|---|---|
1 |
|
Download, extract, language ID, Unicode cleanup |
CPU-only |
2a |
|
GPU-accelerated exact deduplication |
GPU (identify), CPU (remove) |
2b |
|
MinHash + LSH fuzzy deduplication |
GPU (identify), CPU (remove) |
2c |
|
Exact substring deduplication using suffix arrays |
CPU-only |
3 |
|
Ensemble quality scoring into 20 buckets |
GPU (classify), CPU (ensemble) |
4 |
|
LLM-based synthetic data generation on top-quality data |
CPU + LLM endpoint |
Steps 1–3 progressively filter and annotate the data. Step 4 generates synthetic training data (diverse QA, distillation, knowledge extraction, knowledge lists) from the highest-quality documents (buckets 18–19).
Getting Started#
The recipe scripts live in:
src/nemotron/recipes/data_curation/nemotron-cc/
See the recipe README at src/nemotron/recipes/data_curation/nemotron-cc/README.md for detailed per-step documentation, resource recommendations, and usage examples.
Prerequisites#
NeMo Curator installed with Ray support
GPU(s) for steps 2a, 2b, and 3 (deduplication and classification)
Access to an OpenAI-compatible LLM endpoint for step 4 (NVIDIA NIM, vLLM, or cloud API)
After Curation#
Once curated, the output can be tokenized and used for downstream model training.
Further Reading#
Nemotron-CC paper — methodology and evaluation
nvidia/Nemotron-CC-v2 — released dataset on Hugging Face
NeMo Curator — the underlying data curation library
Data Preparation — last-mile processing for training