Deduplication
Remove duplicate and near-duplicate documents from text datasets using NeMo Curator’s GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training.
NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the data processing pipeline .
How It Works
NeMo Curator provides three deduplication approaches, each optimized for different duplicate types:
Exact
Method: MD5 hashing Detects: Character-for-character identical documents Speed: Fastest
Computes MD5 hashes for each document’s text content and groups documents with identical hashes. Best for removing exact copies.
Code Example
For removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Exact Duplicate Removal for details.
Fuzzy
Method: MinHash + Locality Sensitive Hashing (LSH) Detects: Near-duplicates with minor edits (~80% similarity) Speed: Fast
Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos.
Code Example
For removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Fuzzy Duplicate Removal for details.
Semantic
Method: Embeddings + clustering + pairwise similarity Detects: Semantically similar content (paraphrases, translations) Speed: Moderate
Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication.
Code Example
Note: Two workflows available:
TextSemanticDeduplicationWorkflow: For raw text with automatic embedding generationSemanticDeduplicationWorkflow: For pre-computed embeddings
See Semantic Deduplication for details.
Advanced: Step-by-Step Semantic Deduplication
For fine-grained control, break semantic deduplication into separate stages:
This approach enables analysis of intermediate results and fine-grained control.
Deduplication Methods
Choose a deduplication method based on your needs:
Identify and remove character-for-character duplicates using MD5 hashing hashing fast gpu-accelerated
Identify and remove near-duplicates using MinHash and LSH similarity minhash lsh gpu-accelerated
Remove semantically similar documents using embeddings embeddings gpu-accelerated meaning-based advanced
Common Operations
Document IDs
Duplicate removal workflows require stable document identifiers. Choose one approach:
- Use
AddIdto add IDs at the start of your pipeline - Use reader-based ID generation (
_generate_ids,_assign_ids) backed by the ID Generator actor for stable integer IDs - Use existing IDs if your documents already have unique identifiers
Some workflows write an ID generator state file (*_id_generator.json) for later removal when IDs are auto-assigned.
Removing Duplicates
Use TextDuplicatesRemovalWorkflow to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication.
ID Field Configuration
When assign_id=True (IDs auto-assigned):
- Duplicate IDs file contains
_curator_dedup_idcolumn - Set
ids_to_remove_duplicate_id_field="_curator_dedup_id" id_generator_pathis required
When assign_id=False (using existing IDs):
- Duplicate IDs file contains the column specified by
id_field(e.g.,"id") - Set
ids_to_remove_duplicate_id_fieldto match yourid_fieldvalue id_generator_pathnot required
Outputs and Artifacts
Each deduplication method produces specific output files and directories:
Column names:
_curator_dedup_idwhenassign_id=Trueor IDs are auto-assigned- Matches
id_fieldparameter whenassign_id=False
Choosing a Deduplication Method
Compare deduplication methods to select the best approach for your dataset:
Quick Decision Guide
Use this guide to quickly select the right method:
- Start with Exact if you have numerous identical documents or need the fastest speed
- Use Fuzzy if you need to catch near-duplicates with minor formatting differences
- Use Semantic for meaning-based deduplication on large, diverse datasets
When to Use Each Method
Exact Deduplication:
- Removing identical copies of documents
- Fast initial deduplication pass
- Datasets with numerous exact duplicates
- When speed is more important than detecting near-duplicates
Fuzzy Deduplication:
- Removing near-duplicate documents with minor formatting differences
- Detecting documents with small edits or typos
- Fast deduplication when exact matching misses numerous duplicates
- When speed is important but some near-duplicate detection is needed
Semantic Deduplication:
- Removing semantically similar content (paraphrases, translations)
- Large, diverse web-scale datasets
- When meaning-based deduplication is more important than speed
- Advanced use cases requiring embedding-based similarity detection
Combining Methods
You can combine deduplication methods for comprehensive duplicate removal:
- Exact → Fuzzy → Semantic: Start with fastest methods, then apply more sophisticated methods
- Exact → Semantic: Use exact for quick wins, then semantic for meaning-based duplicates
- Fuzzy → Semantic: Use fuzzy for near-duplicates, then semantic for paraphrases
Run each method independently, then combine duplicate IDs before removal.
For detailed implementation guides, see:
Performance Considerations
GPU Acceleration
All deduplication workflows require GPU acceleration:
- Exact: Ray backend with GPU support for MD5 hashing operations
- Fuzzy: Ray backend with GPU support for MinHash computation and LSH operations
- Semantic: GPU required for embedding generation (transformer models), K-means clustering, and pairwise similarity computation
GPU acceleration provides significant speedup for large datasets through parallel processing.
Hardware Requirements
- GPU: Required for all workflows (Ray with GPU support for exact/fuzzy, GPU for semantic)
- Memory: GPU memory requirements scale with dataset size, batch sizes, and embedding dimensions
- Executors: Can use various executors (XennaExecutor, RayDataExecutor) with GPU support
Backend Setup
For optimal performance with large datasets, configure Ray backend:
For TB-scale datasets, consider distributed GPU clusters with Ray.
ID Generator for Large-Scale Operations
For large-scale duplicate removal, persist the ID Generator for consistent document tracking:
The ID Generator ensures consistent IDs across workflow stages.
Next Steps
Ready to use deduplication?
- New to deduplication: Start with Exact Duplicate Removal for the fastest approach
- Need near-duplicate detection: See Fuzzy Duplicate Removal for MinHash-based matching
- Require semantic matching: Explore Semantic Deduplication for meaning-based deduplication
For hands-on guidance: See Text Curation Tutorials for step-by-step examples.