For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
      • Overview
      • Deduplication
        • Overview
          • Loading
          • Acquisition
          • Processing
          • Curation Pipeline
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Architecture Overview
  • Pipeline Stages
  • 1. Data Sources
  • 2. Data Acquisition & Processing
  • 3. Quality Assessment & Filtering
  • 4. Deduplication
  • 5. Final Preparation
  • Infrastructure Foundation
  • Key Components
  • Processing Modes
  • Scalability & Deployment
About NeMo CuratorConceptsText ConceptsData

Text Data Curation Pipeline

||View as Markdown|
Previous

Processing

Next

Overview

This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.

Architecture Overview

The following diagram provides a high-level outline of NeMo Curator’s text curation architecture:

Pipeline Stages

NeMo Curator’s text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training:

1. Data Sources

Multiple input sources provide the foundation for text curation:

  • Cloud storage: Amazon S3, Azure
  • Local workstation: JSONL, Parquet

2. Data Acquisition & Processing

Raw data is downloaded, extracted, and converted into standardized formats:

  • Download & Extraction: Retrieve and process remote data sources
  • Cleaning & Pre-processing: Convert formats and normalize text
  • DocumentBatch Creation: Standardize data into NeMo Curator’s core data structure

3. Quality Assessment & Filtering

Multiple filtering stages ensure data quality:

  • Heuristic Quality Filtering: Rule-based filters for basic quality checks
  • Model-based Quality Filtering: Classification models trained to identify high vs. low quality text

4. Deduplication

Remove duplicate and near-duplicate content:

  • Exact Deduplication: Remove identical documents using MD5 hashing
  • Fuzzy Deduplication: Remove near-duplicates using MinHash and LSH similarity
  • Semantic Deduplication: Remove semantically similar content using embeddings

5. Final Preparation

Prepare the curated dataset for training:

  • Format Standardization: Ensure consistent output format

Infrastructure Foundation

The entire pipeline runs on a robust, scalable infrastructure:

  • Ray: Distributed computing framework for parallelization
  • RAPIDS: GPU-accelerated data processing (cuDF, cuGraph, cuML)
  • Flexible Deployment: CPU and GPU acceleration support

Key Components

The pipeline leverages several core component types:

Data Loading

Core concepts for loading and managing text datasets from local files

Data Acquisition

Components for downloading and extracting data from remote sources

Data Processing

Concepts for filtering, deduplication, and classification

Processing Modes

The pipeline supports different processing approaches:

GPU Acceleration: Leverage NVIDIA GPUs for:

  • High-throughput data processing
  • ML model inference for classification
  • Embedding generation for semantic operations

CPU Processing: Scale across multiple CPU cores for:

  • Text parsing and cleaning
  • Rule-based filtering
  • Large-scale data transformations

Hybrid Workflows: Combine CPU and GPU processing for optimal performance based on the specific operation.

Scalability & Deployment

The architecture scales from single machines to large clusters:

  • Single Node: Process datasets on laptops or workstations
  • Multi-Node: Distribute processing across cluster resources
  • Cloud Native: Deploy on cloud platforms
  • HPC Integration: Run on HPC supercomputing clusters

For hands-on experience, refer to the Text Curation Getting Started Guide .