***

description: >-
Comprehensive overview of deduplication techniques across text, image, and
video modalities including exact, fuzzy, and semantic approaches
categories:

* concepts-architecture
  tags:
* deduplication
* exact-dedup
* fuzzy-dedup
* semantic-dedup
* multimodal
* gpu-accelerated
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: concept
  modality: multimodal

***

# Deduplication Concepts

This guide covers deduplication techniques available across all modalities in NeMo Curator, from exact hash-based matching to semantic similarity detection using embeddings.

## Overview

Deduplication is a critical step in data curation that removes duplicate and near-duplicate content to improve model training efficiency. NeMo Curator provides sophisticated deduplication capabilities that work across text and image modalities.

Removing duplicates offers several benefits:

* **Improved Training Efficiency**: Prevents overrepresentation of repeated content
* **Reduced Dataset Size**: Significantly reduces storage and processing requirements
* **Better Model Performance**: Eliminates redundant examples that can bias training

## Deduplication Approaches

NeMo Curator implements three main deduplication strategies, each with different strengths and use cases:

### Exact Deduplication

* **Method**: Hash-based matching (MD5)
* **Best For**: Identical copies and character-for-character matches
* **Speed**: Very fast
* **Scale**: Unlimited size
* **GPU Required**: Yes (for distributed processing)

Exact deduplication identifies documents or media files that are completely identical by computing cryptographic hashes of their content.

**Modalities Supported**: Text

### Fuzzy Deduplication

* **Method**: MinHash and Locality-Sensitive Hashing (LSH)
* **Best For**: Near-duplicates with minor changes (reformatting, small edits)
* **Speed**: Fast
* **Scale**: Up to petabyte scale
* **GPU Required**: Yes

Fuzzy deduplication uses statistical fingerprinting to identify content that is nearly identical but may have small variations like formatting changes or minor edits.

**Modalities Supported**: Text

### Semantic Deduplication

* **Method**: Embedding-based similarity using neural networks
* **Best For**: Content with similar meaning but different expression
* **Speed**: Moderate (due to embedding generation step)
* **Scale**: Up to terabyte scale
* **GPU Required**: Yes

Semantic deduplication leverages deep learning embeddings to identify content that conveys similar meaning despite using different words, visual elements, or presentation.

**Modalities Supported**: Text, Image

## Multimodal Applications

### Text Deduplication

Text deduplication is the most mature implementation, offering all three approaches:

* **Exact**: Remove identical documents using MD5 hashing
* **Fuzzy**: Remove near-duplicates using MinHash and LSH similarity
* **Semantic**: Remove semantically similar content using embeddings

Text deduplication can handle web-scale datasets and is commonly used for:

* Web crawl data (Common Crawl)
* Academic papers (ArXiv)
* Code repositories
* General text corpora

### Video Deduplication

Video deduplication uses the semantic deduplication workflow with video embeddings:

* **Semantic Clustering**: Uses the general K-means clustering workflow on video embeddings
* **Pairwise Similarity**: Computes within-cluster similarity using the semantic deduplication pipeline
* **Representative Selection**: Leverages the semantic workflow to identify and remove redundant content

Video deduplication is particularly effective for:

* Educational content with similar presentations
* News clips covering the same events
* Entertainment content with repeated segments

### Image Deduplication

Semantic duplicates are images that contain almost the same information content, but are perceptually different.

Image deduplication is computed in Curator by:

* **Generating Embeddings**: Generate CLIP embeddings for images
* **Convert to Text**: Convert the `ImageBatch` embeddings to `DocumentBatch` objects
* **Identify Semantic Duplicates**: Run the text-based semantic deduplication workflow and save the results
* **Remove Duplicates**: Read back the data and remove the identified duplicates

## Architecture and Performance

### Distributed Processing

All deduplication workflows leverage distributed computing frameworks:

* **Ray Backend**: Provides scalable distributed processing
* **GPU Acceleration**: Essential for embedding generation and similarity computation
* **Memory Optimization**: Streaming processing for large datasets

### Scalability Characteristics

| Method   | Dataset Size   | Memory Requirements   | Processing Time            |
| -------- | -------------- | --------------------- | -------------------------- |
| Exact    | Unlimited      | Low (hash storage)    | Linear with data size      |
| Fuzzy    | Petabyte-scale | Moderate (LSH tables) | Sub-linear with LSH        |
| Semantic | Terabyte-scale | High (embeddings)     | Depends on model inference |

## Implementation Patterns

### Workflow-Based Processing

NeMo Curator provides high-level workflows that encapsulate the complete deduplication process:

```python
# Text-based workflow for identifying exact duplicates
from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow

# Text-based workflow for identifying fuzzy duplicates
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow

# Text-based workflow for identifying (and optionally removing) semantic duplicates
from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow

# Text-based workflow for removing identified duplicates
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
```

### Stage-Based Processing

For fine-grained control, individual stages can be composed into custom pipelines:

```python
# Semantic deduplication stages
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
```

## Integration with Pipeline Architecture

Deduplication workflows should be run separately from traditional pipelines in Curator. While other Curator modules are purely map-style operations, meaning that they run on chunks of the data at a time, deduplication workflows identify duplicates across the entire dataset. Thus, special logic is needed outside of Curator's map-style functions, and the deduplication modules are implemented as separate workflows.

Each deduplication workflow expects input and output file path parameters, making them more self-contained than other Curator modules. The input and output of a deduplication workflow can be JSONL or Parquet files, meaning that they are compatible with Curator's traditional pipeline-based read and write stages. Thus, the user may write out intermediate results of a pipeline to be used as input to a deduplication workflow, then read the deduplicated output and resume curation.

As a high level example:

```python
# Define first pipeline
pipeline_1 = Pipeline(name="text_curation_1", description="...")
# Read input data
pipeline_1.add_stage(JsonlReader(...))
# Add more stages like heuristic filters, etc.
pipeline_1.add_stage(...)
# Save intermediate results to JSONL
pipeline_1.add_stage(JsonlWriter(...))
pipeline_1.run()

# Create and run semantic deduplication workflow
workflow = SemanticDeduplicationWorkflow(...)
workflow.run()

# Define second pipeline
pipeline_2 = Pipeline(name="text_curation_2", description="...")
# Read deduplicated data
pipeline_2.add_stage(JsonlReader(...))
# Add more stages like classifiers, etc.
pipeline_2.add_stage(...)
# Save final results to JSONL
pipeline_2.add_stage(JsonlWriter(...))
pipeline_2.run()
```
