Deduplication Concepts#

This guide covers deduplication techniques available across all modalities in NeMo Curator, from exact hash-based matching to semantic similarity detection using embeddings.

Overview#

Deduplication is a critical step in data curation that removes duplicate and near-duplicate content to improve model training efficiency. NeMo Curator provides sophisticated deduplication capabilities that work across text, image, and video modalities.

Removing duplicates offers several benefits:

  • Improved Training Efficiency: Prevents overrepresentation of repeated content

  • Reduced Dataset Size: Significantly reduces storage and processing requirements

  • Better Model Performance: Eliminates redundant examples that can bias training

Deduplication Approaches#

NeMo Curator implements three main deduplication strategies, each with different strengths and use cases:

Exact Deduplication#

  • Method: Hash-based matching (MD5)

  • Best For: Identical copies and character-for-character matches

  • Speed: Very fast

  • GPU Required: Yes (for distributed processing)

Exact deduplication identifies documents or media files that are completely identical by computing cryptographic hashes of their content.

Modalities Supported: Text, Image, Video

Fuzzy Deduplication#

  • Method: MinHash and Locality-Sensitive Hashing (LSH)

  • Best For: Near-duplicates with minor changes (reformatting, small edits)

  • Speed: Fast

  • GPU Required: Yes

Fuzzy deduplication uses statistical fingerprinting to identify content that is nearly identical but may have small variations like formatting changes or minor edits.

Modalities Supported: Text

Semantic Deduplication#

  • Method: Embedding-based similarity using neural networks

  • Best For: Content with similar meaning but different expression

  • Speed: Moderate (requires embedding generation)

  • GPU Required: Yes

Semantic deduplication leverages deep learning embeddings to identify content that conveys similar meaning despite using different words, visual elements, or presentation.

Modalities Supported: Text, Image, Video

Multimodal Applications#

Text Deduplication#

Text deduplication is the most mature implementation, offering all three approaches:

  • Exact: Remove identical documents using MD5 hashing

  • Fuzzy: Remove near-duplicates using MinHash and LSH similarity

  • Semantic: Remove semantically similar content using embeddings

Text deduplication can handle web-scale datasets and is commonly used for:

  • Web crawl data (Common Crawl)

  • Academic papers (ArXiv)

  • Code repositories

  • General text corpora

Video Deduplication#

Video deduplication uses the semantic deduplication workflow with video embeddings:

  • Semantic Clustering: Uses the general K-means clustering workflow on video embeddings

  • Pairwise Similarity: Computes within-cluster similarity using the semantic deduplication pipeline

  • Representative Selection: Leverages the semantic workflow to identify and remove redundant content

Video deduplication is particularly effective for:

  • Educational content with similar presentations

  • News clips covering the same events

  • Entertainment content with repeated segments

Image Deduplication#

Image deduplication capabilities focus on removing duplicate images from datasets:

  • Duplicate Removal: Filters out images identified as duplicates from previous deduplication stages

  • Integration Support: Works with image processing pipelines through ImageBatch tasks

Architecture and Performance#

Distributed Processing#

All deduplication workflows leverage distributed computing frameworks:

  • Ray Backend: Provides scalable distributed processing

  • GPU Acceleration: Essential for embedding generation and similarity computation

  • Memory Optimization: Streaming processing for large datasets

Scalability Characteristics#

Table 1 Deduplication Scalability#

Method

Dataset Size

Memory Requirements

Processing Time

Exact

Unlimited

Low (hash storage)

Linear with data size

Fuzzy

Petabyte-scale

Moderate (LSH tables)

Sub-linear with LSH

Semantic

Terabyte-scale

High (embeddings)

Depends on model inference

Implementation Patterns#

Workflow-Based Processing#

NeMo Curator provides high-level workflows that encapsulate the complete deduplication process:

# Text exact deduplication
from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow

# Text fuzzy deduplication  
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow

# Text semantic deduplication
from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow

Stage-Based Processing#

For fine-grained control, individual stages can be composed into custom pipelines:

# Video semantic deduplication stages
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage

Integration with Pipeline Architecture#

Deduplication integrates seamlessly with NeMo Curator’s pipeline-based architecture:

  1. Input Compatibility: Works with DocumentBatch tasks from any data loading stage

  2. Output Integration: Produces standardized outputs for downstream processing

  3. Chaining Support: Can be combined with filtering and cleaning stages

  4. Executor Support: Compatible with all distributed execution backends