About Text Curation#
NeMo Curator provides comprehensive text curation capabilities to prepare high-quality data for large language model (LLM) training. The toolkit includes a collection of processors for loading, filtering, formatting, and analyzing text data from various sources.
Use Cases#
Clean and prepare web-scraped data from sources like Common Crawl, Wikipedia, and arXiv
Generate high-quality synthetic data using NVIDIA’s advanced LLMs
Create custom text curation pipelines for specific domain needs
Scale text processing across CPU and GPU clusters efficiently
Architecture#
The following diagram provides a high-level outline of NeMo Curator’s text curation architecture.

Introduction#
Master the fundamentals of NeMo Curator and set up your text processing environment.
Learn about DocumentDataset and other core data structures for efficient text curation
Learn prerequisites, setup instructions, and initial configuration for text curation
Curation Tasks#
Load Data#
Import your text data from various sources into NeMo Curator’s processing pipeline.
Extract and process scientific papers from arXiv
Load and preprocess text data from Common Crawl web archives
Load your own text datasets in various formats
Import and process Wikipedia articles for training datasets
Process Data#
Transform and enhance your text data through comprehensive processing and curation steps.
Score and remove low-quality content using heuristics and ML classifiers
Remove duplicate and near-duplicate documents efficiently
Clean, normalize, and transform text content
Handle multilingual content and language-specific processing
Domain-specific processing for code, bitext, and synthetic data
Generate Data#
Create high-quality synthetic text data using advanced language models and customizable pipelines.
Learn how to connect to OpenAI-compatible APIs and self-hosted models
Generate synthetic prompts, dialogues, and entity classifications using the Nemotron-4 340B approach
Combine synthetic data generation with other NeMo Curator modules for filtering and processing