Text Data Curation Pipeline#
This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.
Architecture Overview#
The following diagram provides a high-level outline of NeMo Curator’s text curation architecture:
flowchart LR
A["Data Sources<br/>(Cloud, Local,<br/>Common Crawl, arXiv,<br/>Wikipedia)"] --> B["Data Acquisition<br/>& Loading"]
B --> C["Content Processing<br/>& Cleaning"]
C --> D["Quality Assessment<br/>& Filtering"]
D --> E["Deduplication<br/>(Exact, Fuzzy,<br/>Semantic)"]
E --> F["Curated Dataset<br/>(JSONL/Parquet)"]
G["Ray + RAPIDS<br/>(GPU-accelerated)"] -.->|"Distributed Execution"| B
G -.->|"Distributed Execution"| C
G -.->|"GPU Acceleration"| D
G -.->|"GPU Acceleration"| E
classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000
class A,B,C,D,E stage
class F output
class G infra
Pipeline Stages#
NeMo Curator’s text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training:
1. Data Sources#
Multiple input sources provide the foundation for text curation:
Cloud storage: Amazon S3, Azure
Local workstation: JSONL, Parquet
2. Data Acquisition & Processing#
Raw data is downloaded, extracted, and converted into standardized formats:
Download & Extraction: Retrieve and process remote data sources
Cleaning & Pre-processing: Convert formats and normalize text
DocumentBatch Creation: Standardize data into NeMo Curator’s core data structure
3. Quality Assessment & Filtering#
Multiple filtering stages ensure data quality:
Heuristic Quality Filtering: Rule-based filters for basic quality checks
Model-based Quality Filtering: Classification models trained to identify high vs. low quality text
4. Deduplication#
Remove duplicate and near-duplicate content:
Exact Deduplication: Remove identical documents using MD5 hashing
Fuzzy Deduplication: Remove near-duplicates using MinHash and LSH similarity
Semantic Deduplication: Remove semantically similar content using embeddings
5. Final Preparation#
Prepare the curated dataset for training:
Format Standardization: Ensure consistent output format
Infrastructure Foundation#
The entire pipeline runs on a robust, scalable infrastructure:
Ray: Distributed computing framework for parallelization
RAPIDS: GPU-accelerated data processing (cuDF, cuGraph, cuML)
Flexible Deployment: CPU and GPU acceleration support
Key Components#
The pipeline leverages several core component types:
Core concepts for loading and managing text datasets from local files
Components for downloading and extracting data from remote sources
Concepts for filtering, deduplication, and classification
Processing Modes#
The pipeline supports different processing approaches:
GPU Acceleration: Leverage NVIDIA GPUs for:
High-throughput data processing
ML model inference for classification
Embedding generation for semantic operations
CPU Processing: Scale across multiple CPU cores for:
Text parsing and cleaning
Rule-based filtering
Large-scale data transformations
Hybrid Workflows: Combine CPU and GPU processing for optimal performance based on the specific operation.
Scalability & Deployment#
The architecture scales from single machines to large clusters:
Single Node: Process datasets on laptops or workstations
Multi-Node: Distribute processing across cluster resources
Cloud Native: Deploy on cloud platforms
HPC Integration: Run on HPC supercomputing clusters
For hands-on experience, refer to the Text Curation Getting Started Guide.