Process Data for Text Curation
Process text data you’ve loaded through NeMo Curator’s pipeline architecture .
NeMo Curator provides a comprehensive suite of tools for processing text data as part of the AI training pipeline. These tools help you analyze, transform, and filter your text datasets to ensure high-quality input for language model training.
How it Works
NeMo Curator’s text processing capabilities are organized into five main categories:
- Language Management: Handle multilingual content and language-specific processing
- Content Processing & Cleaning: Clean, normalize, and transform text content
- Deduplication: Remove duplicate and near-duplicate documents efficiently
- Quality Assessment & Filtering: Score and remove low-quality content using heuristics and ML classifiers
- Specialized Processing: Domain-specific processing for code and advanced curation tasks
Each category provides specific implementations optimized for different curation needs. The result is a cleaned and filtered dataset ready for model training.
Language Management
Handle multilingual content and language-specific processing requirements.
Identify document languages and separate multilingual datasets fasttext 176-languages detection
Manage high-frequency words to enhance text extraction and content detection preprocessing filtering language-specific
Content Processing & Cleaning
Clean, normalize, and transform text content for high-quality training data.
Deduplication
Remove duplicate and near-duplicate documents efficiently from your text datasets. All deduplication methods support both identification (finding duplicates) and removal (filtering them out) workflows.
Identify and remove character-for-character duplicates using MD5 hashing hashing fast gpu-accelerated
Identify and remove near-duplicates using MinHash and LSH similarity minhash lsh gpu-accelerated
Identify and remove semantically similar documents using embeddings and clustering embeddings meaning-based gpu-accelerated
Quality Assessment & Filtering
Score and remove low-quality content using heuristics and ML classifiers.
Filter text using configurable rules and metrics rules metrics fast
Filter text using trained quality classifiers ml-models quality scoring
GPU-accelerated classification with pre-trained models gpu distributed scalable
Specialized Processing
Domain-specific processing for code and advanced curation tasks.