Text Processing Concepts#

This guide covers the most common text processing workflows in NVIDIA NeMo Curator, based on real-world usage patterns from production data curation pipelines.

Most Common Workflows#

The majority of NeMo Curator users follow these core workflows, typically in this order:

1. Quality Filtering#

Most users start with basic quality filtering using heuristic filters to remove low-quality content:

Essential Quality Filters:

WordCountFilter - Remove too short/long documents
NonAlphaNumericFilter - Remove symbol-heavy content
RepeatedLinesFilter - Remove repetitive content
PunctuationFilter - Ensure proper sentence structure
BoilerPlateStringFilter - Remove template/boilerplate text

2. Content Cleaning and Modification#

Basic text normalization and cleaning operations:

Common Cleaning Steps:

UnicodeReformatter - Normalize Unicode characters
NewlineNormalizer - Standardize line breaks
Basic HTML/markup removal

3. Deduplication#

Remove duplicate and near-duplicate content. For comprehensive coverage of all deduplication approaches, refer to Deduplication Concepts.

Fuzzy Deduplication#

For production datasets, fuzzy deduplication is essential to remove near-duplicate content across sources:

Key Components:

FuzzyDeduplicationWorkflow - End-to-end fuzzy deduplication pipeline
Ray distributed computing framework for scalability
Connected components clustering for duplicate identification

Exact Deduplication#

Remove identical documents, especially useful for smaller datasets:

Implementation:

ExactDuplicates - Hash-based exact matching
MD5 or SHA-256 hashing for document identification

Semantic Deduplication#

Remove semantically similar content using embeddings for more sophisticated duplicate detection.

Core Processing Architecture#

NeMo Curator uses these fundamental building blocks that users combine into pipelines:

Component	Purpose	Usage Pattern
`Pipeline`	Orchestrate processing stages	Every workflow starts here
`ScoreFilter`	Apply filters with optional scoring	Chain multiple quality filters
`Modify`	Transform document content	Clean and normalize text
Reader/Writer Stages	Load and save text data	Input/output for pipelines
Processing Stages	Transform DocumentBatch tasks	Core processing components

Implementation Examples#

Complete Quality Filtering Pipeline#

This is the most common starting workflow, used in 90% of production pipelines:

Content Cleaning Pipeline#

Basic text normalization:

Large-Scale Fuzzy Deduplication#

Critical for production datasets (requires Ray + GPU):

Exact Deduplication (All dataset sizes)#

Quick deduplication for any dataset size (requires Ray + GPU):

Complete End-to-End Pipeline#

Most users combine these steps into a comprehensive workflow: