About NeMo CuratorConcepts

Deduplication Concepts

View as Markdown

This guide covers deduplication techniques available across all modalities in NeMo Curator, from exact hash-based matching to semantic similarity detection using embeddings.

Overview

Deduplication is a critical step in data curation that removes duplicate and near-duplicate content to improve model training efficiency. NeMo Curator provides sophisticated deduplication capabilities that work across text and image modalities.

Removing duplicates offers several benefits:

  • Improved Training Efficiency: Prevents overrepresentation of repeated content
  • Reduced Dataset Size: Significantly reduces storage and processing requirements
  • Better Model Performance: Eliminates redundant examples that can bias training

Deduplication Approaches

NeMo Curator implements three main deduplication strategies, each with different strengths and use cases:

Exact Deduplication

  • Method: Hash-based matching (MD5)
  • Best For: Identical copies and character-for-character matches
  • Speed: Very fast
  • Scale: Unlimited size
  • GPU Required: Yes (for distributed processing)

Exact deduplication identifies documents or media files that are completely identical by computing cryptographic hashes of their content.

Modalities Supported: Text

Fuzzy Deduplication

  • Method: MinHash and Locality-Sensitive Hashing (LSH)
  • Best For: Near-duplicates with minor changes (reformatting, small edits)
  • Speed: Fast
  • Scale: Up to petabyte scale
  • GPU Required: Yes

Fuzzy deduplication uses statistical fingerprinting to identify content that is nearly identical but may have small variations like formatting changes or minor edits.

Modalities Supported: Text

Semantic Deduplication

  • Method: Embedding-based similarity using neural networks
  • Best For: Content with similar meaning but different expression
  • Speed: Moderate (due to embedding generation step)
  • Scale: Up to terabyte scale
  • GPU Required: Yes

Semantic deduplication leverages deep learning embeddings to identify content that conveys similar meaning despite using different words, visual elements, or presentation.

Modalities Supported: Text, Image

Multimodal Applications

Text Deduplication

Text deduplication is the most mature implementation, offering all three approaches:

  • Exact: Remove identical documents using MD5 hashing
  • Fuzzy: Remove near-duplicates using MinHash and LSH similarity
  • Semantic: Remove semantically similar content using embeddings

Text deduplication can handle web-scale datasets and is commonly used for:

  • Web crawl data (Common Crawl)
  • Academic papers (ArXiv)
  • Code repositories
  • General text corpora

Video Deduplication

Video deduplication uses the semantic deduplication workflow with video embeddings:

  • Semantic Clustering: Uses the general K-means clustering workflow on video embeddings
  • Pairwise Similarity: Computes within-cluster similarity using the semantic deduplication pipeline
  • Representative Selection: Leverages the semantic workflow to identify and remove redundant content

Video deduplication is particularly effective for:

  • Educational content with similar presentations
  • News clips covering the same events
  • Entertainment content with repeated segments

Image Deduplication

Semantic duplicates are images that contain almost the same information content, but are perceptually different.

Image deduplication is computed in Curator by:

  • Generating Embeddings: Generate CLIP embeddings for images
  • Convert to Text: Convert the ImageBatch embeddings to DocumentBatch objects
  • Identify Semantic Duplicates: Run the text-based semantic deduplication workflow and save the results
  • Remove Duplicates: Read back the data and remove the identified duplicates

Architecture and Performance

Distributed Processing

All deduplication workflows leverage distributed computing frameworks:

  • Ray Backend: Provides scalable distributed processing
  • GPU Acceleration: Essential for embedding generation and similarity computation
  • Memory Optimization: Streaming processing for large datasets

Scalability Characteristics

MethodDataset SizeMemory RequirementsProcessing Time
ExactUnlimitedLow (hash storage)Linear with data size
FuzzyPetabyte-scaleModerate (LSH tables)Sub-linear with LSH
SemanticTerabyte-scaleHigh (embeddings)Depends on model inference

Implementation Patterns

Workflow-Based Processing

NeMo Curator provides high-level workflows that encapsulate the complete deduplication process:

1# Text-based workflow for identifying exact duplicates
2from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
3
4# Text-based workflow for identifying fuzzy duplicates
5from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
6
7# Text-based workflow for identifying (and optionally removing) semantic duplicates
8from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow
9
10# Text-based workflow for removing identified duplicates
11from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow

Stage-Based Processing

For fine-grained control, individual stages can be composed into custom pipelines:

1# Semantic deduplication stages
2from nemo_curator.stages.deduplication.semantic.kmeans import KMeansStage
3from nemo_curator.stages.deduplication.semantic.pairwise import PairwiseStage
4from nemo_curator.stages.deduplication.semantic.identify_duplicates import IdentifyDuplicatesStage

Integration with Pipeline Architecture

Deduplication workflows should be run separately from traditional pipelines in Curator. While other Curator modules are purely map-style operations, meaning that they run on chunks of the data at a time, deduplication workflows identify duplicates across the entire dataset. Thus, special logic is needed outside of Curator’s map-style functions, and the deduplication modules are implemented as separate workflows.

Each deduplication workflow expects input and output file path parameters, making them more self-contained than other Curator modules. The input and output of a deduplication workflow can be JSONL or Parquet files, meaning that they are compatible with Curator’s traditional pipeline-based read and write stages. Thus, the user may write out intermediate results of a pipeline to be used as input to a deduplication workflow, then read the deduplicated output and resume curation.

As a high level example:

1# Define first pipeline
2pipeline_1 = Pipeline(name="text_curation_1", description="...")
3# Read input data
4pipeline_1.add_stage(JsonlReader(...))
5# Add more stages like heuristic filters, etc.
6pipeline_1.add_stage(...)
7# Save intermediate results to JSONL
8pipeline_1.add_stage(JsonlWriter(...))
9pipeline_1.run()
10
11# Create and run semantic deduplication workflow
12workflow = SemanticDeduplicationWorkflow(...)
13workflow.run()
14
15# Define second pipeline
16pipeline_2 = Pipeline(name="text_curation_2", description="...")
17# Read deduplicated data
18pipeline_2.add_stage(JsonlReader(...))
19# Add more stages like classifiers, etc.
20pipeline_2.add_stage(...)
21# Save final results to JSONL
22pipeline_2.add_stage(JsonlWriter(...))
23pipeline_2.run()