Process Data for Text Curation#
Process text data you’ve loaded into a DocumentDataset.
NeMo Curator provides a comprehensive suite of tools for processing text data as part of the AI training pipeline. These tools help you analyze, transform, and filter your text datasets to ensure high-quality input for language model training.
How it Works#
NeMo Curator’s text processing capabilities are organized into five main categories:
Quality Assessment & Filtering: Score and remove low-quality content using heuristics and ML classifiers
Deduplication: Remove duplicate and near-duplicate documents efficiently
Content Processing & Cleaning: Clean, normalize, and transform text content
Language Management: Handle multilingual content and language-specific processing
Specialized Processing: Domain-specific processing for code, bitext, and synthetic data
Each category provides specific implementations optimized for different curation needs. The result is a cleaned and filtered dataset ready for model training.
Quality Assessment & Filtering#
Score and remove low-quality content using heuristics and ML classifiers.
Filter text using configurable rules and metrics
Filter text using trained quality classifiers
GPU-accelerated classification with pre-trained models
Implement and combine your own custom filters
Deduplication#
Remove duplicate and near-duplicate documents efficiently from your text datasets.
Remove exact and fuzzy duplicates using GPU acceleration
Remove semantically similar documents using embeddings
Content Processing & Cleaning#
Clean, normalize, and transform text content for high-quality training data.
Identify and remove personal identifiable information
Fix Unicode issues, standardize spacing, and remove URLs
Language Management#
Handle multilingual content and language-specific processing requirements.
Identify document languages and separate multilingual datasets
Manage high-frequency words to enhance text extraction and content detection
Specialized Processing#
Domain-specific processing for code, bitext, synthetic data, and advanced curation tasks.
Specialized filters for programming content and source code
Filter parallel text for translation quality and alignment
Identify AI-generated or synthetic content in datasets
Remove downstream task data from training datasets