Text Processing Concepts#

This guide covers the most common text processing workflows in NVIDIA NeMo Curator, based on real-world usage patterns from production data curation pipelines.

Most Common Workflows#

The majority of NeMo Curator users follow these core workflows, typically in this order:

1. Quality Filtering#

Most users start with basic quality filtering using heuristic filters to remove low-quality content:

Essential Quality Filters:

WordCountFilter - Remove too short/long documents
NonAlphaNumericFilter - Remove symbol-heavy content
RepeatedLinesFilter - Remove if content is too repetitive
PunctuationFilter - Ensure proper sentence structure
BoilerPlateStringFilter - Remove if content contains too much template/boilerplate text

2. Content Cleaning and Modification#

Basic text normalization and cleaning operations:

Common Cleaning Steps:

UnicodeReformatter - Normalize Unicode characters
NewlineNormalizer - Standardize line breaks
Basic HTML/markup removal

3. Deduplication#

Remove duplicate and near-duplicate content. For comprehensive coverage of all deduplication approaches, refer to Curator’s Deduplication Concepts.

Exact Deduplication#

Remove identical documents, especially useful for smaller datasets:

Implementation: MD5 or SHA-256 hashing for document identification

Fuzzy Deduplication#

For production datasets, fuzzy deduplication is essential to remove near-duplicate content across sources:

Key Components:

Ray distributed computing framework for scalability
Connected components clustering for duplicate identification

Semantic Deduplication#

Remove semantically similar content using embeddings for more sophisticated duplicate detection.

Core Processing Architecture#

NeMo Curator uses these fundamental building blocks that users combine into pipelines:

Component	Purpose	Usage Pattern
`Pipeline`	Orchestrate processing stages	Add processing stages, typically starting with a read and completing with a write
`ScoreFilter`	Apply filters with optional scoring	Chain multiple quality filters
`Modify`	Transform document content	Clean and normalize text
Reader/Writer Stages	Load and save text data	Input/output for pipelines
Processing Stages	Transform DocumentBatch tasks	Core processing components

Implementation Examples#

Complete Quality Filtering Pipeline#

This is the most common starting workflow, used in 90% of production pipelines:

Content Cleaning Pipeline#

Basic text normalization:

Exact Deduplication Workflow#

Exact deduplication for any dataset size (requires Ray and at least 1 GPU):

Fuzzy Deduplication Workflow#

Critical for production datasets (requires Ray and at least 1 GPU):