Deduplication | NeMo Curator

This guide covers deduplication techniques available across all modalities in NeMo Curator, from exact hash-based matching to semantic similarity detection using embeddings.

Overview

Deduplication is a critical step in data curation that removes duplicate and near-duplicate content to improve model training efficiency. NeMo Curator provides sophisticated deduplication capabilities that work across text and image modalities.

Removing duplicates offers several benefits:

Improved Training Efficiency: Prevents overrepresentation of repeated content
Reduced Dataset Size: Significantly reduces storage and processing requirements
Better Model Performance: Eliminates redundant examples that can bias training

Deduplication Approaches

NeMo Curator implements three main deduplication strategies, each with different strengths and use cases:

Exact Deduplication

Method: Hash-based matching (MD5)
Best For: Identical copies and character-for-character matches
Speed: Very fast
Scale: Unlimited size
GPU Required: Yes (for distributed processing)

Exact deduplication identifies documents or media files that are completely identical by computing cryptographic hashes of their content.

Modalities Supported: Text

Fuzzy Deduplication

Method: MinHash and Locality-Sensitive Hashing (LSH)
Best For: Near-duplicates with minor changes (reformatting, small edits)
Speed: Fast
Scale: Up to petabyte scale
GPU Required: Yes

Fuzzy deduplication uses statistical fingerprinting to identify content that is nearly identical but may have small variations like formatting changes or minor edits.

Modalities Supported: Text

Semantic Deduplication

Method: Embedding-based similarity using neural networks
Best For: Content with similar meaning but different expression
Speed: Moderate (due to embedding generation step)
Scale: Up to terabyte scale
GPU Required: Yes

Semantic deduplication leverages deep learning embeddings to identify content that conveys similar meaning despite using different words, visual elements, or presentation.

Modalities Supported: Text, Image

Multimodal Applications

Text Deduplication

Text deduplication is the most mature implementation, offering all three approaches:

Exact: Remove identical documents using MD5 hashing
Fuzzy: Remove near-duplicates using MinHash and LSH similarity
Semantic: Remove semantically similar content using embeddings

Text deduplication can handle web-scale datasets and is commonly used for:

Web crawl data (Common Crawl)
Academic papers (ArXiv)
Code repositories
General text corpora

Video Deduplication

Video deduplication uses the semantic deduplication workflow with video embeddings:

Semantic Clustering: Uses the general K-means clustering workflow on video embeddings
Pairwise Similarity: Computes within-cluster similarity using the semantic deduplication pipeline
Representative Selection: Leverages the semantic workflow to identify and remove redundant content

Video deduplication is particularly effective for:

Educational content with similar presentations
News clips covering the same events
Entertainment content with repeated segments

Image Deduplication

Semantic duplicates are images that contain almost the same information content, but are perceptually different.

Image deduplication is computed in Curator by:

Generating Embeddings: Generate CLIP embeddings for images
Convert to Text: Convert the ImageBatch embeddings to DocumentBatch objects
Identify Semantic Duplicates: Run the text-based semantic deduplication workflow and save the results
Remove Duplicates: Read back the data and remove the identified duplicates

Architecture and Performance

Distributed Processing

All deduplication workflows leverage distributed computing frameworks:

Ray Backend: Provides scalable distributed processing
GPU Acceleration: Essential for embedding generation and similarity computation
Memory Optimization: Streaming processing for large datasets

Scalability Characteristics

Method	Dataset Size	Memory Requirements	Processing Time
Exact	Unlimited	Low (hash storage)	Linear with data size
Fuzzy	Petabyte-scale	Moderate (LSH tables)	Sub-linear with LSH
Semantic	Terabyte-scale	High (embeddings)	Depends on model inference

Implementation Patterns

Workflow-Based Processing

NeMo Curator provides high-level workflows that encapsulate the complete deduplication process:

1 # Text-based workflow for identifying exact duplicates
2 from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
3 
4 # Text-based workflow for identifying fuzzy duplicates
5 from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
6 
7 # Text-based workflow for identifying (and optionally removing) semantic duplicates
8 from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow
9 
10 # Text-based workflow for removing identified duplicates
11 from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow

Stage-Based Processing

For fine-grained control, individual stages can be composed into custom pipelines:

1 # Semantic deduplication stages
2 from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
3 from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
4 from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage

Integration with Pipeline Architecture

Deduplication workflows should be run separately from traditional pipelines in Curator. While other Curator modules are purely map-style operations, meaning that they run on chunks of the data at a time, deduplication workflows identify duplicates across the entire dataset. Thus, special logic is needed outside of Curator’s map-style functions, and the deduplication modules are implemented as separate workflows.

Each deduplication workflow expects input and output file path parameters, making them more self-contained than other Curator modules. The input and output of a deduplication workflow can be JSONL or Parquet files, meaning that they are compatible with Curator’s traditional pipeline-based read and write stages. Thus, the user may write out intermediate results of a pipeline to be used as input to a deduplication workflow, then read the deduplicated output and resume curation.

As a high level example:

1 # Define first pipeline
2 pipeline_1 = Pipeline(name="text_curation_1", description="...")
3 # Read input data
4 pipeline_1.add_stage(JsonlReader(...))
5 # Add more stages like heuristic filters, etc.
6 pipeline_1.add_stage(...)
7 # Save intermediate results to JSONL
8 pipeline_1.add_stage(JsonlWriter(...))
9 pipeline_1.run()
10 
11 # Create and run semantic deduplication workflow
12 workflow = SemanticDeduplicationWorkflow(...)
13 workflow.run()
14 
15 # Define second pipeline
16 pipeline_2 = Pipeline(name="text_curation_2", description="...")
17 # Read deduplicated data
18 pipeline_2.add_stage(JsonlReader(...))
19 # Add more stages like classifiers, etc.
20 pipeline_2.add_stage(...)
21 # Save final results to JSONL
22 pipeline_2.add_stage(JsonlWriter(...))
23 pipeline_2.run()