About Text Curation#

NeMo Curator provides comprehensive text curation capabilities to prepare high-quality data for large language model (LLM) training. The toolkit includes a collection of processors for loading, filtering, formatting, and analyzing text data from various sources.

Use Cases#

Clean and prepare web-scraped data from sources like Common Crawl, Wikipedia, and arXiv
Generate high-quality synthetic data using NVIDIA’s advanced LLMs
Create custom text curation pipelines for specific domain needs
Scale text processing across CPU and GPU clusters efficiently

Architecture#

The following diagram provides a high-level outline of NeMo Curator’s text curation architecture.

Introduction#

Master the fundamentals of NeMo Curator and set up your text processing environment.

Concepts

Learn about DocumentDataset and other core data structures for efficient text curation

data-structures distributed architecture

Text Curation Concepts

Get Started

Learn prerequisites, setup instructions, and initial configuration for text curation

setup configuration quickstart

Get Started with Text Curation

Curation Tasks#

Load Data#

Import your text data from various sources into NeMo Curator’s processing pipeline.

arXiv

Extract and process scientific papers from arXiv

academic pdf latex

ArXiv

Common Crawl

Load and preprocess text data from Common Crawl web archives

web-data warc distributed

Common Crawl

Custom Data

Load your own text datasets in various formats

jsonl parquet custom-formats

Custom Data Loading

Wikipedia

Import and process Wikipedia articles for training datasets

articles multilingual dumps

Wikipedia

Process Data#

Transform and enhance your text data through comprehensive processing and curation steps.

Quality Assessment & Filtering

Score and remove low-quality content using heuristics and ML classifiers

heuristics classifiers quality-scoring

Quality Assessment & Filtering

Deduplication

Remove duplicate and near-duplicate documents efficiently

fuzzy-dedup semantic-dedup exact-dedup

Deduplication

Content Processing & Cleaning

Clean, normalize, and transform text content

cleaning pii-removal normalization

Content Processing & Cleaning

Language Management

Handle multilingual content and language-specific processing

language-detection stopwords multilingual

Language Management

Specialized Processing

Domain-specific processing for code, bitext, and synthetic data

code-processing bitext synthetic-data

Specialized Processing

Generate Data#

Create high-quality synthetic text data using advanced language models and customizable pipelines.

Connect to LLM Service

Learn how to connect to OpenAI-compatible APIs and self-hosted models

openai api-integration self-hosted

Connect to an LLM Service

Pipelines

Generate synthetic prompts, dialogues, and entity classifications using the Nemotron-4 340B approach

prompts dialogues classification

Text Data Generation Pipelines

Integration

Combine synthetic data generation with other NeMo Curator modules for filtering and processing

filtering processing pipelines

Integration with NeMo Curator