About Text Curation#

NeMo Curator provides comprehensive text curation capabilities to prepare high-quality data for large language model (LLM) training. The toolkit includes a collection of processors for loading, filtering, formatting, and analyzing text data from various sources using a pipeline-based architecture.

Use Cases#

  • Clean and prepare web-scraped data from sources like Common Crawl, Wikipedia, and arXiv

  • Create custom text curation pipelines for specific domain needs

  • Scale text processing across CPU and GPU clusters efficiently

Architecture#

The following diagram provides a high-level outline of NeMo Curator’s text curation architecture.

High-level outline of NeMo Curator's text curation architecture

Introduction#

Master the fundamentals of NeMo Curator and set up your text processing environment.

Concepts

Learn about pipeline architecture and core processing stages for efficient text curation

Text Curation Concepts
Get Started

Learn prerequisites, setup instructions, and initial configuration for text curation

Get Started with Text Curation

Curation Tasks#

Download Data#

Download text data from remote sources and import existing datasets into NeMo Curator’s processing pipeline.

Read Existing Data

Read existing JSONL and Parquet datasets using Curator’s reader stages

Read Existing Data
arXiv

Extract and process scientific papers from arXiv

ArXiv
Common Crawl

Load and preprocess text data from Common Crawl web archives

Common Crawl
Wikipedia

Import and process Wikipedia articles for training datasets

Wikipedia
Custom Data

Load your own text datasets in various formats

Custom Data Loading

Process Data#

Transform and enhance your text data through comprehensive processing and curation steps.

Quality Assessment & Filtering

Score and remove low-quality content using heuristics and ML classifiers

Quality Assessment & Filtering
Deduplication

Remove duplicate and near-duplicate documents efficiently

Deduplication
Content Processing & Cleaning

Clean, normalize, and transform text content

Content Processing & Cleaning
Language Management

Handle multilingual content and language-specific processing

Language Management
Specialized Processing

Domain-specific processing for code and advanced curation tasks

Specialized Processing