Text Processing Concepts
This guide covers the most common text processing workflows in NVIDIA NeMo Curator, based on real-world usage patterns from production data curation pipelines.
Most Common Workflows
The majority of NeMo Curator users follow these core workflows, typically in this order:
1. Quality Filtering
Most users start with basic quality filtering using heuristic filters to remove low-quality content:
Essential Quality Filters:
WordCountFilter- Remove too short/long documentsNonAlphaNumericFilter- Remove symbol-heavy contentRepeatedLinesFilter- Remove if content is too repetitivePunctuationFilter- Ensure proper sentence structureBoilerPlateStringFilter- Remove if content contains too much template/boilerplate text
2. Content Cleaning and Modification
Basic text normalization and cleaning operations:
Common Cleaning Steps:
UnicodeReformatter- Normalize Unicode charactersNewlineNormalizer- Standardize line breaks- Basic HTML/markup removal
3. Deduplication
Remove duplicate and near-duplicate content. For comprehensive coverage of all deduplication approaches, refer to Curator’s Deduplication Concepts.
Exact Deduplication
Remove identical documents, especially useful for smaller datasets:
Implementation: MD5 or SHA-256 hashing for document identification
Fuzzy Deduplication
For production datasets, fuzzy deduplication is essential to remove near-duplicate content across sources:
Key Components:
- Ray distributed computing framework for scalability
- Connected components clustering for duplicate identification
Semantic Deduplication
Remove semantically similar content using embeddings for more sophisticated duplicate detection.
Core Processing Architecture
NeMo Curator uses these fundamental building blocks that users combine into pipelines:
Implementation Examples
Complete Quality Filtering Pipeline
This is the most common starting workflow, used in 90% of production pipelines:
Quality Filtering Pipeline Code Example
Content Cleaning Pipeline
Basic text normalization:
Content Cleaning Pipeline Code Example
Exact Deduplication Workflow
Exact deduplication for any dataset size (requires Ray and at least 1 GPU):
Exact Deduplication Code Example
Fuzzy Deduplication Workflow
Critical for production datasets (requires Ray and at least 1 GPU):
Fuzzy Deduplication Code Example
Removing Identified Duplicates
The identified duplicates can be removed using a separate workflow: