Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.
For other approaches, refer to Deduplication.
Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:
Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.
Prerequisites:
Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:
Configure fuzzy deduplication using these key parameters:
Key Configuration Parameters
Control matching strictness with num_bands and minhashes_per_band:
num_bands or decrease minhashes_per_bandnum_bands or increase minhashes_per_bandDefault (num_bands=20, minhashes_per_band=13) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.
After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:
When IDs were auto-assigned:
id_generator_path is requiredThe fuzzy deduplication process produces the following directory structure:
The workflow produces these output files:
Duplicate IDs (FuzzyDuplicateIds/*.parquet):
["_curator_dedup_id"]ID Generator (fuzzy_id_generator.json):
Cache Files (cache_path/):
Performance characteristics:
bands_per_iteration controls memory usageGPU requirements:
Performance tuning:
bands_per_iteration (lower = less memory, more iterations)char_ngrams >= 20 to reduce false positivesinput_blocksize="1GiB"Note: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as bands_per_iteration, char_ngrams, and input_blocksize.
For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the Deduplication overview.