> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/curator/_mcp/server.

> Process text data using comprehensive filtering, deduplication, content processing, and specialized tools for high-quality datasets

# Process Data for Text Curation

Process text data you've loaded through NeMo Curator's [pipeline architecture ](/about/concepts/text/data/loading).

NeMo Curator provides a comprehensive suite of tools for processing text data as part of the AI training pipeline. These tools help you analyze, transform, and filter your text datasets to ensure high-quality input for language model training.

## How it Works

NeMo Curator's text processing capabilities are organized into six main categories:

1. **Language Management**: Handle multilingual content and language-specific processing
2. **Content Processing & Cleaning**: Clean, normalize, and transform text content
3. **Deduplication**: Remove duplicate and near-duplicate documents efficiently
4. **Quality Assessment & Filtering**: Score and remove low-quality content using heuristics and ML classifiers
5. **Specialized Processing**: Domain-specific processing for code and advanced curation tasks
6. **Interleaved Datasets**: Read, write, and filter MINT-1T-style image-text datasets

Each category provides specific implementations optimized for different curation needs. The result is a cleaned and filtered dataset ready for model training.

***

## Language Management

Handle multilingual content and language-specific processing requirements.

Identify document languages and separate multilingual datasets
fasttext
176-languages
detection

Manage high-frequency words to enhance text extraction and content detection
preprocessing
filtering
language-specific

## Content Processing & Cleaning

Clean, normalize, and transform text content for high-quality training data.

Fix Unicode issues, standardize spacing, and remove URLs
unicode
normalization
preprocessing

## Deduplication

Remove duplicate and near-duplicate documents efficiently from your text datasets. All deduplication methods support both identification (finding duplicates) and removal (filtering them out) workflows.

Identify and remove character-for-character duplicates using MD5 hashing
hashing
fast
gpu-accelerated

Identify and remove near-duplicates using MinHash and LSH similarity
minhash
lsh
gpu-accelerated

Identify and remove semantically similar documents using embeddings and clustering
embeddings
meaning-based
gpu-accelerated

## Quality Assessment & Filtering

Score and remove low-quality content using heuristics and ML classifiers.

Filter text using configurable rules and metrics
rules
metrics
fast

Filter text using trained quality classifiers
ml-models
quality
scoring

GPU-accelerated classification with pre-trained models
gpu
distributed
scalable

## Specialized Processing

Domain-specific processing for code and advanced curation tasks.

Specialized filters for programming content and source code
programming
syntax
comments

## Interleaved Datasets

Read, write, and filter MINT-1T-style image-text interleaved datasets across WebDataset and Parquet formats.

Round-trip readers and writers between WebDataset tar shards and Parquet
parquet
webdataset
schema-utilities

Sample-level filters for image quality, QR-code detection, CLIP alignment, and image-to-text ratio
blur
clip
qr-detection