This guide covers the most common text processing workflows in NVIDIA NeMo Curator, based on real-world usage patterns from production data curation pipelines.
The majority of NeMo Curator users follow these core workflows, typically in this order:
Most users start with basic quality filtering using heuristic filters to remove low-quality content:
Essential Quality Filters:
WordCountFilter - Remove too short/long documentsNonAlphaNumericFilter - Remove symbol-heavy contentRepeatedLinesFilter - Remove if content is too repetitivePunctuationFilter - Ensure proper sentence structureBoilerPlateStringFilter - Remove if content contains too much template/boilerplate textBasic text normalization and cleaning operations:
Common Cleaning Steps:
UnicodeReformatter - Normalize Unicode charactersNewlineNormalizer - Standardize line breaksRemove duplicate and near-duplicate content. For comprehensive coverage of all deduplication approaches, refer to Curator’s Deduplication Concepts.
Remove identical documents, especially useful for smaller datasets:
Implementation: MD5 or SHA-256 hashing for document identification
For production datasets, fuzzy deduplication is essential to remove near-duplicate content across sources:
Key Components:
Remove semantically similar content using embeddings for more sophisticated duplicate detection.
NeMo Curator uses these fundamental building blocks that users combine into pipelines:
This is the most common starting workflow, used in 90% of production pipelines:
Basic text normalization:
Exact deduplication for any dataset size (requires Ray and at least 1 GPU):
Critical for production datasets (requires Ray and at least 1 GPU):
The identified duplicates can be removed using a separate workflow: