Remove character-for-character duplicate documents using NeMo Curator’s exact duplicate removal workflow. This method computes MD5 hashes for each document’s text and identifies documents with identical hashes as duplicates.
For an overview of all duplicate removal options, refer to Deduplication .
Exact deduplication uses MD5 hashing to identify identical documents:
This method targets character-for-character duplicates and is recommended for removing exact copies of documents.
Prerequisites:
Get started with exact deduplication using the following example of identifying duplicates, then remove them:
Configure exact deduplication using these key parameters:
After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:
When assign_id=True:
_curator_dedup_id columnids_to_remove_duplicate_id_field="_curator_dedup_id"id_generator_path is requiredWhen assign_id=False:
id_field (e.g., "id")ids_to_remove_duplicate_id_field to match your id_field valueid_generator_path not requiredThe exact deduplication process produces the following directory structure:
The workflow produces these output files:
Duplicate IDs (ExactDuplicateIds/*.parquet):
assign_id:
assign_id=True: Column is "_curator_dedup_id"assign_id=False: Column matches the id_field parameter (e.g., "id")ID Generator (exact_id_generator.json):
assign_id=TruePerformance characteristics:
ExactDuplicateIds/ directoryBest practices:
input_blocksize values (256MiB to 512MiB) with a larger identification_batchsize to target 2-6 GB of overall batches as memory allows. For example, input_blocksize="256MiB" with identification_batchsize=8 processes ~2 GB per insertion call. This improves both shuffle throughput and removal performance compared to the 2GiB default.assign_id=True for consistent ID trackingtotal_nparts to smaller values (256 or 512) often leads to better shuffle performance compared to the defaults for really large runsrmm_pool_size and spill_memory_limit explicitly to match your hardware and dataset sizeFor comparison with other deduplication methods and guidance on when to use exact deduplication, refer to the Deduplication overview .