For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
    • Installation
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Use Cases
  • Architecture
  • Introduction
  • Curation Tasks
  • Download Data
  • Process Data
Curate Text

About Text Curation

||View as Markdown|
Previous

Get Started with Audio Curation

Next

Text Curation Tutorials

NeMo Curator provides comprehensive text curation capabilities to prepare high-quality data for large language model (LLM) training. The toolkit includes a collection of processors for loading, filtering, formatting, and analyzing text data from various sources using a pipeline-based architecture.

Use Cases

  • Clean and prepare web-scraped data from sources like Common Crawl, Wikipedia, and arXiv
  • Create custom text curation pipelines for specific domain needs
  • Scale text processing across CPU and GPU clusters efficiently

Architecture

The following diagram provides a high-level outline of NeMo Curator’s text curation architecture.


Introduction

Master the fundamentals of NeMo Curator and set up your text processing environment.

Concepts

Learn about pipeline architecture and core processing stages for efficient text curation

Get Started

Learn prerequisites, setup instructions, and initial configuration for text curation

Curation Tasks

Download Data

Download text data from remote sources and import existing datasets into NeMo Curator’s processing pipeline.

Read Existing Data

Read existing JSONL and Parquet datasets using Curator’s reader stages

arXiv

Download and extract scientific papers from arXiv

Common Crawl

Download and extract web archive data from Common Crawl

Wikipedia

Download and extract Wikipedia articles from Wikipedia dumps

Custom Data Sources

Implement a download and extract pipeline for a custom data source

Process Data

Transform and enhance your text data through comprehensive processing and curation steps.

Language Management

Handle multilingual content and language-specific processing

Content Processing & Cleaning

Clean, normalize, and transform text content

Deduplication

Remove duplicate and near-duplicate documents efficiently

Quality Assessment & Filtering

Score and remove low-quality content

Specialized Processing

Domain-specific processing for code and advanced curation tasks