Nemotron-CC Data Curation#
The Nemotron-CC recipe curates high-quality pretraining data from Common Crawl, producing datasets similar to nvidia/Nemotron-CC-v2. It serves as a reference for running these curation steps on your own data.
Built on NeMo Curator and Ray, the pipeline scales from a single machine to large GPU clusters.
Pipeline#
The recipe is a four-step pipeline that progressively refines raw web data into curated text and synthetic training data:
Common Crawl → Extract & Clean → Deduplicate → Quality Classify → Synthetic Data Generation
Step |
Script |
Description |
Resources |
|---|---|---|---|
1 |
|
Download, extract, language ID, Unicode cleanup |
CPU-only |
2a |
|
GPU-accelerated exact deduplication |
GPU (identify), CPU (remove) |
2b |
|
MinHash + LSH fuzzy deduplication |
GPU (identify), CPU (remove) |
2c |
|
Exact substring deduplication using suffix arrays |
CPU-only |
3 |
|
Ensemble quality scoring into 20 buckets |
GPU (classify), CPU (ensemble) |
4 |
|
LLM-based synthetic data generation on top-quality data |
GPU (local inference server, default) or CPU + external LLM endpoint (with |
Steps 1–3 progressively filter and annotate the data. Step 4 generates synthetic training data (diverse QA, distillation, knowledge extraction, knowledge lists) from the highest-quality documents (buckets 18–19).
Getting Started#
The recipe scripts live in:
src/nemotron/recipes/data/curation/nemotron-cc/
See the recipe README at src/nemotron/recipes/data/curation/nemotron-cc/README.md for detailed per-step documentation, resource recommendations, and usage examples.
Prerequisites#
NeMo Curator 1.2.0 (26.04 release) or newer, installed with Ray support
GPU(s) for steps 2a, 2b, and 3 (deduplication and classification)
For step 4, one of: GPU(s) to host a local inference server (default), an OpenAI-compatible endpoint (self-hosted vLLM/NIM or cloud, via
--no-serve-model), or an NVIDIA Build API key
After Curation#
Once curated, the output can be tokenized and used for downstream model training.
Further Reading#
Nemotron-CC paper — methodology and evaluation
nvidia/Nemotron-CC-v2 — released dataset on Hugging Face
NeMo Curator — the underlying data curation library
Data Preparation — last-mile processing for training