Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.
For other approaches, refer to Deduplication .
Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:
Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.
Prerequisites:
Running in Docker: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with --gpus all so that Ray workers can access the GPU. Without GPU access, you may see CUDARuntimeError or AttributeError: 'CUDARuntimeError' object has no attribute 'msg'. Also activate the virtual environment with source /opt/venv/env.sh after entering the container.
Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:
Configure fuzzy deduplication using these key parameters:
Control matching strictness with num_bands and minhashes_per_band:
num_bands or decrease minhashes_per_bandnum_bands or increase minhashes_per_bandDefault (num_bands=20, minhashes_per_band=13) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.
After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:
When IDs were auto-assigned:
id_generator_path is requiredThe fuzzy deduplication process produces the following directory structure:
The workflow produces these output files:
Duplicate IDs (FuzzyDuplicateIds/*.parquet):
["_curator_dedup_id"]ID Generator (fuzzy_id_generator.json):
Cache Files (cache_path/):
Performance characteristics:
bands_per_iteration controls memory usageGPU requirements:
Performance tuning:
bands_per_iteration (lower = less memory, more iterations)lsh_rmm_pool_size to control GPU memory allocation and lsh_spill_memory_limit to tune host-spilling behavior during the LSH stage. Reducing the pool size or lowering the spill threshold can prevent out-of-memory errors on smaller GPUs.lsh_num_output_partitions to control the number of output partitions during the LSH shuffle. More partitions reduce per-partition memory but increase I/O overhead.char_ngrams >= 20 to reduce false positivesinput_blocksize="1GiB"Note: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as bands_per_iteration, char_ngrams, and input_blocksize.
For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the Deduplication overview .