This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.
The following diagram provides a high-level outline of NeMo Curator’s text curation architecture:
NeMo Curator’s text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training:
Multiple input sources provide the foundation for text curation:
Raw data is downloaded, extracted, and converted into standardized formats:
Multiple filtering stages ensure data quality:
Remove duplicate and near-duplicate content:
Prepare the curated dataset for training:
The entire pipeline runs on a robust, scalable infrastructure:
The pipeline leverages several core component types:
Core concepts for loading and managing text datasets from local files
Components for downloading and extracting data from remote sources
Concepts for filtering, deduplication, and classification
The pipeline supports different processing approaches:
GPU Acceleration: Leverage NVIDIA GPUs for:
CPU Processing: Scale across multiple CPU cores for:
Hybrid Workflows: Combine CPU and GPU processing for optimal performance based on the specific operation.
The architecture scales from single machines to large clusters:
For hands-on experience, refer to the Text Curation Getting Started Guide .