This guide covers deduplication techniques available across all modalities in NeMo Curator, from exact hash-based matching to semantic similarity detection using embeddings.
Deduplication is a critical step in data curation that removes duplicate and near-duplicate content to improve model training efficiency. NeMo Curator provides sophisticated deduplication capabilities that work across text and image modalities.
Removing duplicates offers several benefits:
NeMo Curator implements three main deduplication strategies, each with different strengths and use cases:
Exact deduplication identifies documents or media files that are completely identical by computing cryptographic hashes of their content.
Modalities Supported: Text
Fuzzy deduplication uses statistical fingerprinting to identify content that is nearly identical but may have small variations like formatting changes or minor edits.
Modalities Supported: Text
Semantic deduplication leverages deep learning embeddings to identify content that conveys similar meaning despite using different words, visual elements, or presentation.
Modalities Supported: Text, Image
Text deduplication is the most mature implementation, offering all three approaches:
Text deduplication can handle web-scale datasets and is commonly used for:
Video deduplication uses the semantic deduplication workflow with video embeddings:
Video deduplication is particularly effective for:
Semantic duplicates are images that contain almost the same information content, but are perceptually different.
Image deduplication is computed in Curator by:
ImageBatch embeddings to DocumentBatch objectsAll deduplication workflows leverage distributed computing frameworks:
NeMo Curator provides high-level workflows that encapsulate the complete deduplication process:
For fine-grained control, individual stages can be composed into custom pipelines:
Deduplication workflows should be run separately from traditional pipelines in Curator. While other Curator modules are purely map-style operations, meaning that they run on chunks of the data at a time, deduplication workflows identify duplicates across the entire dataset. Thus, special logic is needed outside of Curator’s map-style functions, and the deduplication modules are implemented as separate workflows.
Each deduplication workflow expects input and output file path parameters, making them more self-contained than other Curator modules. The input and output of a deduplication workflow can be JSONL or Parquet files, meaning that they are compatible with Curator’s traditional pipeline-based read and write stages. Thus, the user may write out intermediate results of a pipeline to be used as input to a deduplication workflow, then read the deduplicated output and resume curation.
As a high level example: