***

description: >-
Process text data using comprehensive filtering, deduplication, content
processing, and specialized tools for high-quality datasets
categories:

* workflows
  tags:
* data-processing
* filtering
* deduplication
* content-processing
* quality-assessment
* distributed
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: workflow
  modality: text-only

***

# Process Data for Text Curation

Process text data you've loaded through NeMo Curator's [pipeline architecture ](/about/concepts/text/data/loading).

NeMo Curator provides a comprehensive suite of tools for processing text data as part of the AI training pipeline. These tools help you analyze, transform, and filter your text datasets to ensure high-quality input for language model training.

## How it Works

NeMo Curator's text processing capabilities are organized into five main categories:

1. **Language Management**: Handle multilingual content and language-specific processing
2. **Content Processing & Cleaning**: Clean, normalize, and transform text content
3. **Deduplication**: Remove duplicate and near-duplicate documents efficiently
4. **Quality Assessment & Filtering**: Score and remove low-quality content using heuristics and ML classifiers
5. **Specialized Processing**: Domain-specific processing for code and advanced curation tasks

Each category provides specific implementations optimized for different curation needs. The result is a cleaned and filtered dataset ready for model training.

***

## Language Management

Handle multilingual content and language-specific processing requirements.

<Cards>
  <Card title="Language Identification" href="/curate-text/process-data/language-management/language">
    Identify document languages and separate multilingual datasets
    fasttext
    176-languages
    detection
  </Card>

  <Card title="Stop Words" href="/curate-text/process-data/language-management/stopwords">
    Manage high-frequency words to enhance text extraction and content detection
    preprocessing
    filtering
    language-specific
  </Card>
</Cards>

## Content Processing & Cleaning

Clean, normalize, and transform text content for high-quality training data.

<Cards>
  <Card title="Text Cleaning" href="/curate-text/process-data/content-processing/text-cleaning">
    Fix Unicode issues, standardize spacing, and remove URLs
    unicode
    normalization
    preprocessing
  </Card>
</Cards>

## Deduplication

Remove duplicate and near-duplicate documents efficiently from your text datasets. All deduplication methods support both identification (finding duplicates) and removal (filtering them out) workflows.

<Cards>
  <Card title="Exact Duplicate Removal" href="/curate-text/process-data/deduplication/exact">
    Identify and remove character-for-character duplicates using MD5 hashing
    hashing
    fast
    gpu-accelerated
  </Card>

  <Card title="Fuzzy Duplicate Removal" href="/curate-text/process-data/deduplication/fuzzy">
    Identify and remove near-duplicates using MinHash and LSH similarity
    minhash
    lsh
    gpu-accelerated
  </Card>

  <Card title="Semantic Deduplication" href="/curate-text/process-data/deduplication/semdedup">
    Identify and remove semantically similar documents using embeddings and clustering
    embeddings
    meaning-based
    gpu-accelerated
  </Card>
</Cards>

## Quality Assessment & Filtering

Score and remove low-quality content using heuristics and ML classifiers.

<Cards>
  <Card title="Heuristic Filtering" href="/curate-text/process-data/quality-assessment/heuristic">
    Filter text using configurable rules and metrics
    rules
    metrics
    fast
  </Card>

  <Card title="Classifier Filtering" href="/curate-text/process-data/quality-assessment/classifier">
    Filter text using trained quality classifiers
    ml-models
    quality
    scoring
  </Card>

  <Card title="Distributed Classification" href="/curate-text/process-data/quality-assessment/distributed-classifier">
    GPU-accelerated classification with pre-trained models
    gpu
    distributed
    scalable
  </Card>
</Cards>

## Specialized Processing

Domain-specific processing for code and advanced curation tasks.

<Cards>
  <Card title="Code Processing" href="/curate-text/process-data/specialized-processing/code">
    Specialized filters for programming content and source code
    programming
    syntax
    comments
  </Card>
</Cards>
