Nemotron-CC Data Curation#

The Nemotron-CC recipe curates high-quality pretraining data from Common Crawl, producing datasets similar to nvidia/Nemotron-CC-v2. It serves as a reference for running these curation steps on your own data.

Built on NeMo Curator and Ray, the pipeline scales from a single machine to large GPU clusters.

Pipeline#

The recipe is a four-step pipeline that progressively refines raw web data into curated text and synthetic training data:

Common Crawl → Extract & Clean → Deduplicate → Quality Classify → Synthetic Data Generation

Step	Script	Description	Resources
1	`step_1-download_extract.py`	Download, extract, language ID, Unicode cleanup	CPU-only
2a	`step_2a-exact_dedup.py`	GPU-accelerated exact deduplication	GPU (identify), CPU (remove)
2b	`step_2b-fuzzy_dedup.py`	MinHash + LSH fuzzy deduplication	GPU (identify), CPU (remove)
2c	`step_2c-substring_dedup/`	Exact substring deduplication using suffix arrays	CPU-only
3	`step_3-quality_classification.py`	Ensemble quality scoring into 20 buckets	GPU (classify), CPU (ensemble)
4	`step_4-sdg.py`	LLM-based synthetic data generation on top-quality data	GPU (local inference server, default) or CPU + external LLM endpoint (with `--no-serve-model`)

Steps 1–3 progressively filter and annotate the data. Step 4 generates synthetic training data (diverse QA, distillation, knowledge extraction, knowledge lists) from the highest-quality documents (buckets 18–19).

Getting Started#

The recipe scripts live in:

src/nemotron/recipes/data/curation/nemotron-cc/

See the recipe README at src/nemotron/recipes/data/curation/nemotron-cc/README.md for detailed per-step documentation, resource recommendations, and usage examples.

Prerequisites#

NeMo Curator 1.2.0 (26.04 release) or newer, installed with Ray support
GPU(s) for steps 2a, 2b, and 3 (deduplication and classification)
For step 4, one of: GPU(s) to host a local inference server (default), an OpenAI-compatible endpoint (self-hosted vLLM/NIM or cloud, via --no-serve-model), or an NVIDIA Build API key

After Curation#

Once curated, the output can be tokenized and used for downstream model training.

Nemotron-CC Data Curation#

Pipeline#

Getting Started#

Prerequisites#

After Curation#

Further Reading#