***
description: >-
Process text data using comprehensive filtering, deduplication, content
processing, and specialized tools for high-quality datasets
categories:
* workflows
tags:
* data-processing
* filtering
* deduplication
* content-processing
* quality-assessment
* distributed
personas:
* data-scientist-focused
* mle-focused
difficulty: intermediate
content\_type: workflow
modality: text-only
***
# Process Data for Text Curation
Process text data you've loaded through NeMo Curator's [pipeline architecture ](/about/concepts/text/data/loading).
NeMo Curator provides a comprehensive suite of tools for processing text data as part of the AI training pipeline. These tools help you analyze, transform, and filter your text datasets to ensure high-quality input for language model training.
## How it Works
NeMo Curator's text processing capabilities are organized into five main categories:
1. **Language Management**: Handle multilingual content and language-specific processing
2. **Content Processing & Cleaning**: Clean, normalize, and transform text content
3. **Deduplication**: Remove duplicate and near-duplicate documents efficiently
4. **Quality Assessment & Filtering**: Score and remove low-quality content using heuristics and ML classifiers
5. **Specialized Processing**: Domain-specific processing for code and advanced curation tasks
Each category provides specific implementations optimized for different curation needs. The result is a cleaned and filtered dataset ready for model training.
***
## Language Management
Handle multilingual content and language-specific processing requirements.
Identify document languages and separate multilingual datasets
fasttext
176-languages
detection
Manage high-frequency words to enhance text extraction and content detection
preprocessing
filtering
language-specific
## Content Processing & Cleaning
Clean, normalize, and transform text content for high-quality training data.
Fix Unicode issues, standardize spacing, and remove URLs
unicode
normalization
preprocessing
## Deduplication
Remove duplicate and near-duplicate documents efficiently from your text datasets. All deduplication methods support both identification (finding duplicates) and removal (filtering them out) workflows.
Identify and remove character-for-character duplicates using MD5 hashing
hashing
fast
gpu-accelerated
Identify and remove near-duplicates using MinHash and LSH similarity
minhash
lsh
gpu-accelerated
Identify and remove semantically similar documents using embeddings and clustering
embeddings
meaning-based
gpu-accelerated
## Quality Assessment & Filtering
Score and remove low-quality content using heuristics and ML classifiers.
Filter text using configurable rules and metrics
rules
metrics
fast
Filter text using trained quality classifiers
ml-models
quality
scoring
GPU-accelerated classification with pre-trained models
gpu
distributed
scalable
## Specialized Processing
Domain-specific processing for code and advanced curation tasks.
Specialized filters for programming content and source code
programming
syntax
comments