Remove duplicate and near-duplicate documents from text datasets using NeMo Curator’s GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training.
NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the data processing pipeline.
NeMo Curator provides three deduplication approaches, each optimized for different duplicate types:
Method: MD5 hashing
Detects: Character-for-character identical documents
Speed: Fastest
Computes MD5 hashes for each document’s text content and groups documents with identical hashes. Best for removing exact copies.
For removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Exact Duplicate Removal for details.
Choose a deduplication method based on your needs:
Identify and remove character-for-character duplicates using MD5 hashing
Identify and remove near-duplicates using MinHash and LSH similarity
Remove semantically similar documents using embeddings
Duplicate removal workflows require stable document identifiers. Choose one approach:
AddId to add IDs at the start of your pipeline_generate_ids, _assign_ids) backed by the ID Generator actor for stable integer IDsSome workflows write an ID generator state file (*_id_generator.json) for later removal when IDs are auto-assigned.
Use TextDuplicatesRemovalWorkflow to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication.
When assign_id=True (IDs auto-assigned):
_curator_dedup_id columnids_to_remove_duplicate_id_field="_curator_dedup_id"id_generator_path is requiredWhen assign_id=False (using existing IDs):
id_field (e.g., "id")ids_to_remove_duplicate_id_field to match your id_field valueid_generator_path not requiredEach deduplication method produces specific output files and directories:
Output Locations
Column names:
_curator_dedup_id when assign_id=True or IDs are auto-assignedid_field parameter when assign_id=FalseCompare deduplication methods to select the best approach for your dataset:
Method Comparison
Use this guide to quickly select the right method:
Exact Deduplication:
Fuzzy Deduplication:
Semantic Deduplication:
You can combine deduplication methods for comprehensive duplicate removal:
Run each method independently, then combine duplicate IDs before removal.
For detailed implementation guides, see:
All deduplication workflows require GPU acceleration:
GPU acceleration provides significant speedup for large datasets through parallel processing.
For optimal performance with large datasets, configure Ray backend:
For TB-scale datasets, consider distributed GPU clusters with Ray.
For large-scale duplicate removal, persist the ID Generator for consistent document tracking:
The ID Generator ensures consistent IDs across workflow stages.
Ready to use deduplication?
For hands-on guidance: See Text Curation Tutorials for step-by-step examples.