About Text Curation#

NeMo Curator provides comprehensive text curation capabilities to prepare high-quality data for large language model (LLM) training. The toolkit includes a collection of processors for loading, filtering, formatting, and analyzing text data from various sources using a pipeline-based architecture.

Use Cases#

Clean and prepare web-scraped data from sources like Common Crawl, Wikipedia, and arXiv
Create custom text curation pipelines for specific domain needs
Scale text processing across CPU and GPU clusters efficiently

Architecture#

The following diagram provides a high-level outline of NeMo Curator’s text curation architecture.

        flowchart LR
    A["Data Sources<br/>(Cloud, Local,<br/>Common Crawl, arXiv,<br/>Wikipedia)"] --> B["Data Acquisition<br/>& Loading"]
    B --> C["Content Processing<br/>& Cleaning"]
    C --> D["Quality Assessment<br/>& Filtering"]
    D --> E["Deduplication<br/>(Exact, Fuzzy,<br/>Semantic)"]
    E --> F["Curated Dataset<br/>(JSONL/Parquet)"]
    
    G["Ray + RAPIDS<br/>(GPU-accelerated)"] -.->|"Distributed Execution"| B
    G -.->|"Distributed Execution"| C
    G -.->|"GPU Acceleration"| D
    G -.->|"GPU Acceleration"| E

    classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000

    class A,B,C,D,E stage
    class F output
    class G infra

Introduction#

Master the fundamentals of NeMo Curator and set up your text processing environment.

Concepts

Learn about pipeline architecture and core processing stages for efficient text curation

data-structures distributed architecture

Text Curation Concepts

Get Started

Learn prerequisites, setup instructions, and initial configuration for text curation

setup configuration quickstart

Get Started with Text Curation

Curation Tasks#

Download Data#

Download text data from remote sources and import existing datasets into NeMo Curator’s processing pipeline.

Read Existing Data

Read existing JSONL and Parquet datasets using Curator’s reader stages

jsonl parquet

Read Existing Data

arXiv

Download and extract scientific papers from arXiv

academic pdf latex

Common Crawl

Download and extract web archive data from Common Crawl

web-data warc distributed

Wikipedia

Download and extract Wikipedia articles from Wikipedia dumps

articles multilingual dumps

Custom Data Sources

Implement a download and extract pipeline for a custom data source

jsonl parquet custom-formats

Custom Data Loading

Process Data#

Transform and enhance your text data through comprehensive processing and curation steps.

Language Management

Handle multilingual content and language-specific processing

language-detection stopwords multilingual

Language Management

Content Processing & Cleaning

Clean, normalize, and transform text content

cleaning normalization formatting

Content Processing & Cleaning

Deduplication

Remove duplicate and near-duplicate documents efficiently

fuzzy-dedup semantic-dedup exact-dedup

Quality Assessment & Filtering

Score and remove low-quality content

heuristics classifiers quality-scoring

Quality Assessment & Filtering

Specialized Processing

Domain-specific processing for code and advanced curation tasks

code-processing

Specialized Processing