*** description: >- Comprehensive overview of NeMo Curator's text curation pipeline architecture including data acquisition and processing categories: * concepts-architecture tags: * pipeline * architecture * text-curation * distributed * gpu-accelerated * overview personas: * data-scientist-focused * mle-focused difficulty: beginner content\_type: concept modality: text-only *** # Text Data Curation Pipeline This guide provides a comprehensive overview of NeMo Curator's text curation pipeline architecture, from data acquisition through final dataset preparation. ## Architecture Overview The following diagram provides a high-level outline of NeMo Curator's text curation architecture: ```mermaid flowchart LR A["Data Sources
(Cloud, Local,
Common Crawl, arXiv,
Wikipedia)"] --> B["Data Acquisition
& Loading"] B --> C["Content Processing
& Cleaning"] C --> D["Quality Assessment
& Filtering"] D --> E["Deduplication
(Exact, Fuzzy,
Semantic)"] E --> F["Curated Dataset
(JSONL/Parquet)"] G["Ray + RAPIDS
(GPU-accelerated)"] -.->|"Distributed Execution"| B G -.->|"Distributed Execution"| C G -.->|"GPU Acceleration"| D G -.->|"GPU Acceleration"| E classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000 classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000 classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000 class A,B,C,D,E stage class F output class G infra ``` ## Pipeline Stages NeMo Curator's text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training: ### 1. Data Sources Multiple input sources provide the foundation for text curation: * **Cloud storage**: Amazon S3, Azure * **Local workstation**: JSONL, Parquet ### 2. Data Acquisition & Processing Raw data is downloaded, extracted, and converted into standardized formats: * **Download & Extraction**: Retrieve and process remote data sources * **Cleaning & Pre-processing**: Convert formats and normalize text * **DocumentBatch Creation**: Standardize data into NeMo Curator's core data structure ### 3. Quality Assessment & Filtering Multiple filtering stages ensure data quality: * **Heuristic Quality Filtering**: Rule-based filters for basic quality checks * **Model-based Quality Filtering**: Classification models trained to identify high vs. low quality text ### 4. Deduplication Remove duplicate and near-duplicate content: * **Exact Deduplication**: Remove identical documents using MD5 hashing * **Fuzzy Deduplication**: Remove near-duplicates using MinHash and LSH similarity * **Semantic Deduplication**: Remove semantically similar content using embeddings ### 5. Final Preparation Prepare the curated dataset for training: * **Format Standardization**: Ensure consistent output format ## Infrastructure Foundation The entire pipeline runs on a robust, scalable infrastructure: * **Ray**: Distributed computing framework for parallelization * **RAPIDS**: GPU-accelerated data processing (cuDF, cuGraph, cuML) * **Flexible Deployment**: CPU and GPU acceleration support ## Key Components The pipeline leverages several core component types: Core concepts for loading and managing text datasets from local files Components for downloading and extracting data from remote sources Concepts for filtering, deduplication, and classification ## Processing Modes The pipeline supports different processing approaches: **GPU Acceleration**: Leverage NVIDIA GPUs for: * High-throughput data processing * ML model inference for classification * Embedding generation for semantic operations **CPU Processing**: Scale across multiple CPU cores for: * Text parsing and cleaning * Rule-based filtering * Large-scale data transformations **Hybrid Workflows**: Combine CPU and GPU processing for optimal performance based on the specific operation. ## Scalability & Deployment The architecture scales from single machines to large clusters: * **Single Node**: Process datasets on laptops or workstations * **Multi-Node**: Distribute processing across cluster resources * **Cloud Native**: Deploy on cloud platforms * **HPC Integration**: Run on HPC supercomputing clusters *** For hands-on experience, refer to the [Text Curation Getting Started Guide ](/get-started/text).