Exact Duplicate Removal
Remove character-for-character duplicate documents using NeMo Curator’s exact duplicate removal workflow. This method computes MD5 hashes for each document’s text and identifies documents with identical hashes as duplicates.
For an overview of all duplicate removal options, refer to Deduplication .
How It Works
Exact deduplication uses MD5 hashing to identify identical documents:
- Computes MD5 hash for each document’s text content
- Groups documents by identical hash values
- Identifies duplicates and saves IDs for removal
This method targets character-for-character duplicates and is recommended for removing exact copies of documents.
Before You Start
Prerequisites:
- Ray cluster with GPU support (required for distributed processing)
- Stable document identifiers for removal (either existing IDs or IDs assigned by the workflow)
Quick Start
Get started with exact deduplication using the following example of identifying duplicates, then remove them:
Configuration
Configure exact deduplication using these key parameters:
Removing Duplicates
After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:
ID Field Configuration
When assign_id=True:
- Duplicate IDs file contains
_curator_dedup_idcolumn - Set
ids_to_remove_duplicate_id_field="_curator_dedup_id" id_generator_pathis required
When assign_id=False:
- Duplicate IDs file contains the column specified by
id_field(e.g.,"id") - Set
ids_to_remove_duplicate_id_fieldto match yourid_fieldvalue id_generator_pathnot required
Output Format
The exact deduplication process produces the following directory structure:
File Formats
The workflow produces these output files:
-
Duplicate IDs (
ExactDuplicateIds/*.parquet):- Contains document IDs to remove
- Format: Parquet files with a single ID column
- Column name depends on
assign_id:- When
assign_id=True: Column is"_curator_dedup_id" - When
assign_id=False: Column matches theid_fieldparameter (e.g.,"id")
- When
- Important: Contains only the IDs of documents to remove, not the full document content
-
ID Generator (
exact_id_generator.json):- JSON file containing ID generator state
- Required for removal workflow when
assign_id=True - Ensures consistent ID mapping across workflow stages
Performance Considerations
Performance characteristics:
- Uses MD5 hashing over the configured text field to derive duplicate groups
- Runs as a Ray-based workflow and writes duplicate IDs to the
ExactDuplicateIds/directory - Stores only document IDs to remove in the output files, not full document content
Best practices:
- Use
input_blocksize="2GiB"for optimal performance - Clear output directory between runs
- Use
assign_id=Truefor consistent ID tracking
For comparison with other deduplication methods and guidance on when to use exact deduplication, refer to the Deduplication overview .