Curate TextProcess Data

Process Data for Text Curation

View as Markdown

Process text data you’ve loaded through NeMo Curator’s pipeline architecture .

NeMo Curator provides a comprehensive suite of tools for processing text data as part of the AI training pipeline. These tools help you analyze, transform, and filter your text datasets to ensure high-quality input for language model training.

How it Works

NeMo Curator’s text processing capabilities are organized into five main categories:

  1. Language Management: Handle multilingual content and language-specific processing
  2. Content Processing & Cleaning: Clean, normalize, and transform text content
  3. Deduplication: Remove duplicate and near-duplicate documents efficiently
  4. Quality Assessment & Filtering: Score and remove low-quality content using heuristics and ML classifiers
  5. Specialized Processing: Domain-specific processing for code and advanced curation tasks

Each category provides specific implementations optimized for different curation needs. The result is a cleaned and filtered dataset ready for model training.


Language Management

Handle multilingual content and language-specific processing requirements.

Content Processing & Cleaning

Clean, normalize, and transform text content for high-quality training data.

Deduplication

Remove duplicate and near-duplicate documents efficiently from your text datasets. All deduplication methods support both identification (finding duplicates) and removal (filtering them out) workflows.

Quality Assessment & Filtering

Score and remove low-quality content using heuristics and ML classifiers.

Specialized Processing

Domain-specific processing for code and advanced curation tasks.