> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/curator/_mcp/server.

> Comprehensive text curation capabilities for preparing high-quality data for large language model training with loading, filtering, and quality assessment

# About Text Curation

NeMo Curator provides comprehensive text curation capabilities to prepare high-quality data for large language model (LLM) training. The toolkit includes a collection of processors for loading, filtering, formatting, and analyzing text data from various sources using a [pipeline-based architecture ](/about/concepts/text/data/data-curation-pipeline).

## Use Cases

* Clean and prepare web-scraped data from sources like Common Crawl, Wikipedia, and arXiv
* Create custom text curation pipelines for specific domain needs
* Scale text processing across CPU and GPU clusters efficiently

## Architecture

The following diagram provides a high-level outline of NeMo Curator's text curation architecture.

```mermaid
flowchart LR
    A["Data Sources<br />(Cloud, Local,<br />Common Crawl, arXiv,<br />Wikipedia)"] --> B["Data Acquisition<br />& Loading"]
    B --> C["Content Processing<br />& Cleaning"]
    C --> D["Quality Assessment<br />& Filtering"]
    D --> E["Deduplication<br />(Exact, Fuzzy,<br />Semantic)"]
    E --> F["Curated Dataset<br />(JSONL/Parquet)"]
    
    G["Ray + RAPIDS<br />(GPU-accelerated)"] -.->|"Distributed Execution"| B
    G -.->|"Distributed Execution"| C
    G -.->|"GPU Acceleration"| D
    G -.->|"GPU Acceleration"| E

    classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000

    class A,B,C,D,E stage
    class F output
    class G infra
```

***

## Introduction

Master the fundamentals of NeMo Curator and set up your text processing environment.

Learn about pipeline architecture and core processing stages for efficient text curation
data-structures
distributed
architecture

Learn prerequisites, setup instructions, and initial configuration for text curation
setup
configuration
quickstart

## Curation Tasks

### Download Data

Download text data from remote sources and import existing datasets into NeMo Curator's processing pipeline.

Read existing JSONL and Parquet datasets using Curator's reader stages
jsonl
parquet

Download and extract scientific papers from arXiv
academic
pdf
latex

Download and extract web archive data from Common Crawl
web-data
warc
distributed

Download and extract Wikipedia articles from Wikipedia dumps
articles
multilingual
dumps

Implement a download and extract pipeline for a custom data source
jsonl
parquet
custom-formats

### Process Data

Transform and enhance your text data through comprehensive processing and curation steps.

Handle multilingual content and language-specific processing
language-detection
stopwords
multilingual

Clean, normalize, and transform text content
cleaning
normalization
formatting

Remove duplicate and near-duplicate documents efficiently
fuzzy-dedup
semantic-dedup
exact-dedup

Score and remove low-quality content
heuristics
classifiers
quality-scoring

Domain-specific processing for code and advanced curation tasks
code-processing

Generate and augment training data using LLMs
llm
augmentation
multilingual
nemotron-cc