Remove duplicate and near-duplicate documents from text datasets using NeMo Curator’s GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training.
NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the data processing pipeline .
NeMo Curator provides three deduplication approaches, each optimized for different duplicate types:
Method: MD5 hashing Detects: Character-for-character identical documents Speed: Fastest
Computes MD5 hashes for each document’s text content and groups documents with identical hashes. Best for removing exact copies.
For removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Exact Duplicate Removal for details.
Method: MinHash + Locality Sensitive Hashing (LSH) Detects: Near-duplicates with minor edits (~80% similarity) Speed: Fast
Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos.
For removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Fuzzy Duplicate Removal for details.
Method: Embeddings + clustering + pairwise similarity Detects: Semantically similar content (paraphrases, translations) Speed: Moderate
Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication.
Note: Two workflows available:
TextSemanticDeduplicationWorkflow: For raw text with automatic embedding generationSemanticDeduplicationWorkflow: For pre-computed embeddingsSee Semantic Deduplication for details.
For fine-grained control, break semantic deduplication into separate stages:
This approach enables analysis of intermediate results and fine-grained control.
Choose a deduplication method based on your needs:
Identify and remove character-for-character duplicates using MD5 hashing hashing fast gpu-accelerated
Identify and remove near-duplicates using MinHash and LSH similarity minhash lsh gpu-accelerated
Remove semantically similar documents using embeddings embeddings gpu-accelerated meaning-based advanced
Duplicate removal workflows require stable document identifiers. Choose one approach:
AddId to add IDs at the start of your pipeline_generate_ids, _assign_ids) backed by the ID Generator actor for stable integer IDsSome workflows write an ID generator state file (*_id_generator.json) for later removal when IDs are auto-assigned.
Use TextDuplicatesRemovalWorkflow to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication.
When assign_id=True (IDs auto-assigned):
_curator_dedup_id columnids_to_remove_duplicate_id_field="_curator_dedup_id"id_generator_path is requiredWhen assign_id=False (using existing IDs):
id_field (e.g., "id")ids_to_remove_duplicate_id_field to match your id_field valueid_generator_path not requiredAll deduplication workflows return a WorkflowRunResult object with timing and duplicate count metadata:
Available metadata varies by workflow. Common keys include total_time and num_duplicates.
Each deduplication method produces specific output files and directories:
Column names:
_curator_dedup_id when assign_id=True or IDs are auto-assignedid_field parameter when assign_id=FalseCompare deduplication methods to select the best approach for your dataset:
Use this guide to quickly select the right method:
Exact Deduplication:
Fuzzy Deduplication:
Semantic Deduplication:
You can combine deduplication methods for comprehensive duplicate removal:
Run each method independently, then combine duplicate IDs before removal.
For detailed implementation guides, see:
All deduplication workflows require GPU acceleration:
GPU acceleration provides significant speedup for large datasets through parallel processing.
For optimal performance with large datasets, configure Ray backend:
For TB-scale datasets, consider distributed GPU clusters with Ray.
For large-scale duplicate removal, persist the ID Generator for consistent document tracking:
The ID Generator ensures consistent IDs across workflow stages.
Ready to use deduplication?
For hands-on guidance: See Text Curation Tutorials for step-by-step examples.