***

description: >-
Comprehensive overview of NeMo Curator's text curation pipeline architecture
including data acquisition and processing
categories:

* concepts-architecture
  tags:
* pipeline
* architecture
* text-curation
* distributed
* gpu-accelerated
* overview
  personas:
* data-scientist-focused
* mle-focused
  difficulty: beginner
  content\_type: concept
  modality: text-only

***

# Text Data Curation Pipeline

This guide provides a comprehensive overview of NeMo Curator's text curation pipeline architecture, from data acquisition through final dataset preparation.

## Architecture Overview

The following diagram provides a high-level outline of NeMo Curator's text curation architecture:

```mermaid
flowchart LR
    A["Data Sources<br />(Cloud, Local,<br />Common Crawl, arXiv,<br />Wikipedia)"] --> B["Data Acquisition<br />& Loading"]
    B --> C["Content Processing<br />& Cleaning"]
    C --> D["Quality Assessment<br />& Filtering"]
    D --> E["Deduplication<br />(Exact, Fuzzy,<br />Semantic)"]
    E --> F["Curated Dataset<br />(JSONL/Parquet)"]
    
    G["Ray + RAPIDS<br />(GPU-accelerated)"] -.->|"Distributed Execution"| B
    G -.->|"Distributed Execution"| C
    G -.->|"GPU Acceleration"| D
    G -.->|"GPU Acceleration"| E

    classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000

    class A,B,C,D,E stage
    class F output
    class G infra
```

## Pipeline Stages

NeMo Curator's text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training:

### 1. Data Sources

Multiple input sources provide the foundation for text curation:

* **Cloud storage**: Amazon S3, Azure
* **Local workstation**: JSONL, Parquet

### 2. Data Acquisition & Processing

Raw data is downloaded, extracted, and converted into standardized formats:

* **Download & Extraction**: Retrieve and process remote data sources
* **Cleaning & Pre-processing**: Convert formats and normalize text
* **DocumentBatch Creation**: Standardize data into NeMo Curator's core data structure

### 3. Quality Assessment & Filtering

Multiple filtering stages ensure data quality:

* **Heuristic Quality Filtering**: Rule-based filters for basic quality checks
* **Model-based Quality Filtering**: Classification models trained to identify high vs. low quality text

### 4. Deduplication

Remove duplicate and near-duplicate content:

* **Exact Deduplication**: Remove identical documents using MD5 hashing
* **Fuzzy Deduplication**: Remove near-duplicates using MinHash and LSH similarity
* **Semantic Deduplication**: Remove semantically similar content using embeddings

### 5. Final Preparation

Prepare the curated dataset for training:

* **Format Standardization**: Ensure consistent output format

## Infrastructure Foundation

The entire pipeline runs on a robust, scalable infrastructure:

* **Ray**: Distributed computing framework for parallelization
* **RAPIDS**: GPU-accelerated data processing (cuDF, cuGraph, cuML)
* **Flexible Deployment**: CPU and GPU acceleration support

## Key Components

The pipeline leverages several core component types:

<Cards>
  <Card title="Data Loading" href="/about/concepts/text/data/loading">
    Core concepts for loading and managing text datasets from local files
  </Card>

  <Card title="Data Acquisition" href="/about/concepts/text/data/acquisition">
    Components for downloading and extracting data from remote sources
  </Card>

  <Card title="Data Processing" href="/about/concepts/text/data/processing">
    Concepts for filtering, deduplication, and classification
  </Card>
</Cards>

## Processing Modes

The pipeline supports different processing approaches:

**GPU Acceleration**: Leverage NVIDIA GPUs for:

* High-throughput data processing
* ML model inference for classification
* Embedding generation for semantic operations

**CPU Processing**: Scale across multiple CPU cores for:

* Text parsing and cleaning
* Rule-based filtering
* Large-scale data transformations

**Hybrid Workflows**: Combine CPU and GPU processing for optimal performance based on the specific operation.

## Scalability & Deployment

The architecture scales from single machines to large clusters:

* **Single Node**: Process datasets on laptops or workstations
* **Multi-Node**: Distribute processing across cluster resources
* **Cloud Native**: Deploy on cloud platforms
* **HPC Integration**: Run on HPC supercomputing clusters

***

For hands-on experience, refer to the [Text Curation Getting Started Guide ](/get-started/text).
