Process Data for Text Curation#

Process text data you’ve loaded into a DocumentDataset.

NeMo Curator provides a comprehensive suite of tools for processing text data as part of the AI training pipeline. These tools help you analyze, transform, and filter your text datasets to ensure high-quality input for language model training.

How it Works#

NeMo Curator’s text processing capabilities are organized into five main categories:

  1. Quality Assessment & Filtering: Score and remove low-quality content using heuristics and ML classifiers

  2. Deduplication: Remove duplicate and near-duplicate documents efficiently

  3. Content Processing & Cleaning: Clean, normalize, and transform text content

  4. Language Management: Handle multilingual content and language-specific processing

  5. Specialized Processing: Domain-specific processing for code, bitext, and synthetic data

Each category provides specific implementations optimized for different curation needs. The result is a cleaned and filtered dataset ready for model training.


Quality Assessment & Filtering#

Score and remove low-quality content using heuristics and ML classifiers.

Heuristic Filtering

Filter text using configurable rules and metrics

Heuristic Filtering
Classifier Filtering

Filter text using trained quality classifiers

Classifier-Based Filtering
Distributed Classification

GPU-accelerated classification with pre-trained models

Distributed Data Classification
Custom Filters

Implement and combine your own custom filters

Custom Filters

Deduplication#

Remove duplicate and near-duplicate documents efficiently from your text datasets.

GPU Deduplication

Remove exact and fuzzy duplicates using GPU acceleration

Hash-Based Duplicate Removal
Semantic Deduplication

Remove semantically similar documents using embeddings

Semantic Deduplication

Content Processing & Cleaning#

Clean, normalize, and transform text content for high-quality training data.

PII Removal

Identify and remove personal identifiable information

PII Identification and Removal
Text Cleaning

Fix Unicode issues, standardize spacing, and remove URLs

Text Cleaning

Language Management#

Handle multilingual content and language-specific processing requirements.

Language Identification

Identify document languages and separate multilingual datasets

Language Identification and Unicode Fixing
Stop Words

Manage high-frequency words to enhance text extraction and content detection

Stop Words in Text Processing

Specialized Processing#

Domain-specific processing for code, bitext, synthetic data, and advanced curation tasks.

Code Processing

Specialized filters for programming content and source code

Code Filtering
Parallel Text (Bitext)

Filter parallel text for translation quality and alignment

Bitext Filtering
Synthetic Data Detection

Identify AI-generated or synthetic content in datasets

Synthetic Text Detection
Task Decontamination

Remove downstream task data from training datasets

Downstream Task Decontamination