Text Data Curation Pipeline#
This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.
Architecture Overview#
The following diagram provides a high-level outline of NeMo Curator’s text curation architecture:
 
Pipeline Stages#
NeMo Curator’s text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training:
1. Data Sources#
Multiple input sources provide the foundation for text curation:
- Cloud storage (S3, GCS, Azure) 
- Internet sources (Common Crawl, ArXiv, Wikipedia) 
- Local workstation files 
2. Data Acquisition & Processing#
Raw data is downloaded, extracted, and converted into standardized formats:
- Download & Extraction: Retrieve and process remote data sources 
- Cleaning & Pre-processing: Convert formats and normalize text 
- DocumentDataset Creation: Standardize data into NeMo Curator’s core data structure 
3. Quality Assessment & Filtering#
Multiple filtering stages ensure data quality:
- Heuristic Quality Filtering: Rule-based filters for basic quality checks 
- Model-based Quality Filtering: AI-powered content assessment 
- PII Removal: Privacy-preserving data cleaning 
- Task Decontamination: Remove potential test set contamination 
4. Deduplication#
Remove duplicate and near-duplicate content:
- Exact Deduplication: Remove identical documents 
- Fuzzy Deduplication: Remove near-duplicates using similarity 
- Semantic Deduplication: Remove semantically similar content using embeddings 
5. Synthetic Data Generation#
Create high-quality synthetic content using LLMs:
- LLM-based Generation: Use large language models to create new content 
- Quality Control: Ensure synthetic data meets quality standards 
6. Final Preparation#
Prepare the curated dataset for training:
- Blending/Shuffling: Combine and randomize data sources 
- Format Standardization: Ensure consistent output format 
Infrastructure Foundation#
The entire pipeline runs on a robust, scalable infrastructure:
- Dask: Distributed computing framework for parallelization 
- RAPIDS: GPU-accelerated data processing (cuDF, cuGraph, cuML) 
- Flexible Deployment: CPU and GPU acceleration support 
Key Components#
The pipeline leverages several core component types:
Core concepts for loading and managing text datasets from local files
Components for downloading and extracting data from remote sources
Concepts for filtering, deduplication, and classification
Concepts for generating high-quality synthetic text
Processing Modes#
The pipeline supports different processing approaches:
GPU Acceleration: Leverage NVIDIA GPUs for:
- High-throughput data processing 
- ML model inference for classification 
- Embedding generation for semantic operations 
CPU Processing: Scale across multiple CPU cores for:
- Text parsing and cleaning 
- Rule-based filtering 
- Large-scale data transformations 
Hybrid Workflows: Combine CPU and GPU processing for optimal performance based on the specific operation.
Scalability & Deployment#
The architecture scales from single machines to large clusters:
- Single Node: Process datasets on laptops or workstations 
- Multi-Node: Distribute processing across cluster resources 
- Cloud Native: Deploy on Kubernetes or cloud platforms 
- HPC Integration: Run on Slurm-managed supercomputing clusters 
For hands-on experience, see the Text Curation Getting Started Guide.