For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
      • Overview
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
    • Installation
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How it Works
  • Language Management
  • Content Processing & Cleaning
  • Deduplication
  • Quality Assessment & Filtering
  • Specialized Processing
Curate TextProcess Data

Process Data for Text Curation

||View as Markdown|
Previous

Wikipedia

Next

Content Processing & Cleaning

Process text data you’ve loaded through NeMo Curator’s pipeline architecture.

NeMo Curator provides a comprehensive suite of tools for processing text data as part of the AI training pipeline. These tools help you analyze, transform, and filter your text datasets to ensure high-quality input for language model training.

How it Works

NeMo Curator’s text processing capabilities are organized into five main categories:

  1. Language Management: Handle multilingual content and language-specific processing
  2. Content Processing & Cleaning: Clean, normalize, and transform text content
  3. Deduplication: Remove duplicate and near-duplicate documents efficiently
  4. Quality Assessment & Filtering: Score and remove low-quality content using heuristics and ML classifiers
  5. Specialized Processing: Domain-specific processing for code and advanced curation tasks

Each category provides specific implementations optimized for different curation needs. The result is a cleaned and filtered dataset ready for model training.


Language Management

Handle multilingual content and language-specific processing requirements.

Language Identification

Identify document languages and separate multilingual datasets

Stop Words

Manage high-frequency words to enhance text extraction and content detection

Content Processing & Cleaning

Clean, normalize, and transform text content for high-quality training data.

Text Cleaning

Fix Unicode issues, standardize spacing, and remove URLs

Deduplication

Remove duplicate and near-duplicate documents efficiently from your text datasets. All deduplication methods support both identification (finding duplicates) and removal (filtering them out) workflows.

Exact Duplicate Removal

Identify and remove character-for-character duplicates using MD5 hashing

Fuzzy Duplicate Removal

Identify and remove near-duplicates using MinHash and LSH similarity

Semantic Deduplication

Identify and remove semantically similar documents using embeddings and clustering

Quality Assessment & Filtering

Score and remove low-quality content using heuristics and ML classifiers.

Heuristic Filtering

Filter text using configurable rules and metrics

Classifier Filtering

Filter text using trained quality classifiers

Distributed Classification

GPU-accelerated classification with pre-trained models

Specialized Processing

Domain-specific processing for code and advanced curation tasks.

Code Processing

Specialized filters for programming content and source code