> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

> Process text data using language management, translation, filtering, deduplication, content processing, and specialized tools for high-quality datasets

# Process Data for Text Curation

Process text data you've loaded through NeMo Curator's [pipeline architecture ](/about/concepts/text/data/loading).

NeMo Curator provides a comprehensive suite of tools for processing text data as part of the AI training pipeline. These tools help you analyze, transform, and filter your text datasets to ensure high-quality input for language model training.

## How it Works

NeMo Curator's text processing capabilities are organized into six main categories:

1. **Language Management**: Handle multilingual content, translation, and language-specific processing
2. **Content Processing & Cleaning**: Clean, normalize, and transform text content
3. **Deduplication**: Remove duplicate and near-duplicate documents efficiently
4. **Quality Assessment & Filtering**: Score and remove low-quality content using heuristics and ML classifiers
5. **Specialized Processing**: Domain-specific processing for code and advanced curation tasks
6. **Interleaved Datasets**: Read, write, and filter MINT-1T-style image-text datasets

Each category provides specific implementations optimized for different curation needs. The result is a cleaned and filtered dataset ready for model training.

***

## Language Management

Handle multilingual content, translation, and language-specific processing requirements.

<Cards>
  <Card title="Language Identification" href="/curate-text/process-data/language-management/language">
    Identify document languages and separate multilingual datasets
    fasttext
    176-languages
    detection
  </Card>

  <Card title="Stop Words" href="/curate-text/process-data/language-management/stopwords">
    Manage high-frequency words to enhance text extraction and content detection
    preprocessing
    filtering
    language-specific
  </Card>

  <Card title="Translation" href="/curate-text/process-data/language-management/translation">
    Translate flat or structured text fields with optional FAITH and round-trip evaluation
    translation
    experimental
    wildcard-fields
    faith
  </Card>
</Cards>

## Content Processing & Cleaning

Clean, normalize, and transform text content for high-quality training data.

<Cards>
  <Card title="Text Cleaning" href="/curate-text/process-data/content-processing/text-cleaning">
    Fix Unicode issues, standardize spacing, and remove URLs
    unicode
    normalization
    preprocessing
  </Card>
</Cards>

## Deduplication

Remove duplicate and near-duplicate documents efficiently from your text datasets. All deduplication methods support both identification (finding duplicates) and removal (filtering them out) workflows.

<Cards>
  <Card title="Exact Duplicate Removal" href="/curate-text/process-data/deduplication/exact">
    Identify and remove character-for-character duplicates using MD5 hashing
    hashing
    fast
    gpu-accelerated
  </Card>

  <Card title="Fuzzy Duplicate Removal" href="/curate-text/process-data/deduplication/fuzzy">
    Identify and remove near-duplicates using MinHash and LSH similarity
    minhash
    lsh
    gpu-accelerated
  </Card>

  <Card title="Semantic Deduplication" href="/curate-text/process-data/deduplication/semdedup">
    Identify and remove semantically similar documents using embeddings and clustering
    embeddings
    meaning-based
    gpu-accelerated
  </Card>
</Cards>

## Quality Assessment & Filtering

Score and remove low-quality content using heuristics and ML classifiers.

<Cards>
  <Card title="Heuristic Filtering" href="/curate-text/process-data/quality-assessment/heuristic">
    Filter text using configurable rules and metrics
    rules
    metrics
    fast
  </Card>

  <Card title="Classifier Filtering" href="/curate-text/process-data/quality-assessment/classifier">
    Filter text using trained quality classifiers
    ml-models
    quality
    scoring
  </Card>

  <Card title="Distributed Classification" href="/curate-text/process-data/quality-assessment/distributed-classifier">
    GPU-accelerated classification with pre-trained models
    gpu
    distributed
    scalable
  </Card>
</Cards>

## Specialized Processing

Domain-specific processing for code and advanced curation tasks.

<Cards>
  <Card title="Code Processing" href="/curate-text/process-data/specialized-processing/code">
    Specialized filters for programming content and source code
    programming
    syntax
    comments
  </Card>
</Cards>

## Interleaved Datasets

Read, write, and filter MINT-1T-style image-text interleaved datasets across WebDataset and Parquet formats.

<Cards>
  <Card title="Interleaved IO" href="/curate-text/process-data/interleaved/io">
    Round-trip readers and writers between WebDataset tar shards and Parquet
    parquet
    webdataset
    schema-utilities
  </Card>

  <Card title="Interleaved Filters" href="/curate-text/process-data/interleaved/filters">
    Sample-level filters for image quality, QR-code detection, CLIP alignment, and image-to-text ratio
    blur
    clip
    qr-detection
  </Card>
</Cards>