Deduplication Concepts#
This guide covers deduplication techniques available across all modalities in NeMo Curator, from exact hash-based matching to semantic similarity detection using embeddings.
Overview#
Deduplication is a critical step in data curation that removes duplicate and near-duplicate content to improve model training efficiency. NeMo Curator provides sophisticated deduplication capabilities that work across text and image modalities.
Removing duplicates offers several benefits:
Improved Training Efficiency: Prevents overrepresentation of repeated content
Reduced Dataset Size: Significantly reduces storage and processing requirements
Better Model Performance: Eliminates redundant examples that can bias training
Deduplication Approaches#
NeMo Curator implements three main deduplication strategies, each with different strengths and use cases:
Exact Deduplication#
Method: Hash-based matching (MD5)
Best For: Identical copies and character-for-character matches
Speed: Very fast
Scale: Unlimited size
GPU Required: Yes (for distributed processing)
Exact deduplication identifies documents or media files that are completely identical by computing cryptographic hashes of their content.
Modalities Supported: Text
Fuzzy Deduplication#
Method: MinHash and Locality-Sensitive Hashing (LSH)
Best For: Near-duplicates with minor changes (reformatting, small edits)
Speed: Fast
Scale: Up to petabyte scale
GPU Required: Yes
Fuzzy deduplication uses statistical fingerprinting to identify content that is nearly identical but may have small variations like formatting changes or minor edits.
Modalities Supported: Text
Semantic Deduplication#
Method: Embedding-based similarity using neural networks
Best For: Content with similar meaning but different expression
Speed: Moderate (due to embedding generation step)
Scale: Up to terabyte scale
GPU Required: Yes
Semantic deduplication leverages deep learning embeddings to identify content that conveys similar meaning despite using different words, visual elements, or presentation.
Modalities Supported: Text, Image
Multimodal Applications#
Text Deduplication#
Text deduplication is the most mature implementation, offering all three approaches:
Exact: Remove identical documents using MD5 hashing
Fuzzy: Remove near-duplicates using MinHash and LSH similarity
Semantic: Remove semantically similar content using embeddings
Text deduplication can handle web-scale datasets and is commonly used for:
Web crawl data (Common Crawl)
Academic papers (ArXiv)
Code repositories
General text corpora
Video Deduplication#
Video deduplication uses the semantic deduplication workflow with video embeddings:
Semantic Clustering: Uses the general K-means clustering workflow on video embeddings
Pairwise Similarity: Computes within-cluster similarity using the semantic deduplication pipeline
Representative Selection: Leverages the semantic workflow to identify and remove redundant content
Video deduplication is particularly effective for:
Educational content with similar presentations
News clips covering the same events
Entertainment content with repeated segments
Image Deduplication#
Semantic duplicates are images that contain almost the same information content, but are perceptually different.
Image deduplication is computed in Curator by:
Generating Embeddings: Generate CLIP embeddings for images
Convert to Text: Convert the
ImageBatchembeddings toDocumentBatchobjectsIdentify Semantic Duplicates: Run the text-based semantic deduplication workflow and save the results
Remove Duplicates: Read back the data and remove the identified duplicates
Architecture and Performance#
Distributed Processing#
All deduplication workflows leverage distributed computing frameworks:
Ray Backend: Provides scalable distributed processing
GPU Acceleration: Essential for embedding generation and similarity computation
Memory Optimization: Streaming processing for large datasets
Scalability Characteristics#
Method |
Dataset Size |
Memory Requirements |
Processing Time |
|---|---|---|---|
Exact |
Unlimited |
Low (hash storage) |
Linear with data size |
Fuzzy |
Petabyte-scale |
Moderate (LSH tables) |
Sub-linear with LSH |
Semantic |
Terabyte-scale |
High (embeddings) |
Depends on model inference |
Implementation Patterns#
Workflow-Based Processing#
NeMo Curator provides high-level workflows that encapsulate the complete deduplication process:
# Text-based workflow for identifying exact duplicates
from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
# Text-based workflow for identifying fuzzy duplicates
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
# Text-based workflow for identifying (and optionally removing) semantic duplicates
from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow
# Text-based workflow for removing identified duplicates
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
Stage-Based Processing#
For fine-grained control, individual stages can be composed into custom pipelines:
# Semantic deduplication stages
from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage
Integration with Pipeline Architecture#
Deduplication workflows should be run separately from traditional pipelines in Curator. While other Curator modules are purely map-style operations, meaning that they run on chunks of the data at a time, deduplication workflows identify duplicates across the entire dataset. Thus, special logic is needed outside of Curator’s map-style functions, and the deduplication modules are implemented as separate workflows.
Each deduplication workflow expects input and output file path parameters, making them more self-contained than other Curator modules. The input and output of a deduplication workflow can be JSONL or Parquet files, meaning that they are compatible with Curator’s traditional pipeline-based read and write stages. Thus, the user may write out intermediate results of a pipeline to be used as input to a deduplication workflow, then read the deduplicated output and resume curation.
As a high level example:
# Define first pipeline
pipeline_1 = Pipeline(name="text_curation_1", description="...")
# Read input data
pipeline_1.add_stage(JsonlReader(...))
# Add more stages like heuristic filters, etc.
pipeline_1.add_stage(...)
# Save intermediate results to JSONL
pipeline_1.add_stage(JsonlWriter(...))
pipeline_1.run()
# Create and run semantic deduplication workflow
workflow = SemanticDeduplicationWorkflow(...)
workflow.run()
# Define second pipeline
pipeline_2 = Pipeline(name="text_curation_2", description="...")
# Read deduplicated data
pipeline_2.add_stage(JsonlReader(...))
# Add more stages like classifiers, etc.
pipeline_2.add_stage(...)
# Save final results to JSONL
pipeline_2.add_stage(JsonlWriter(...))
pipeline_2.run()