Fuzzy Duplicate Removal
Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.
For other approaches, refer to Deduplication .
How It Works
Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:
- Computes MinHash signatures over character n-grams
- Uses Locality Sensitive Hashing (LSH) to find candidate matches
- Builds a graph of duplicate relationships
- Identifies groups of near-duplicate documents
Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.
Before You Start
Prerequisites:
- Ray cluster with GPU support (required for distributed processing)
- Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)
Running in Docker: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with --gpus all so that Ray workers can access the GPU. Without GPU access, you may see CUDARuntimeError or AttributeError: 'CUDARuntimeError' object has no attribute 'msg'. Also activate the virtual environment with source /opt/venv/env.sh after entering the container.
Quick Start
Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:
Configuration
Configure fuzzy deduplication using these key parameters:
Similarity Threshold
Control matching strictness with num_bands and minhashes_per_band:
- More strict matching: Increase
num_bandsor decreaseminhashes_per_band - Less strict matching: Decrease
num_bandsor increaseminhashes_per_band
Default (num_bands=20, minhashes_per_band=13) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.
Custom Similarity Threshold
Removing Duplicates
After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:
ID Field Configuration
When IDs were auto-assigned:
id_generator_pathis required- Ensures consistent ID mapping between identification and removal stages
Output Format
The fuzzy deduplication process produces the following directory structure:
File Formats
The workflow produces these output files:
-
Duplicate IDs (
FuzzyDuplicateIds/*.parquet):- Contains document IDs to remove
- Format: Parquet files with column:
["_curator_dedup_id"] - Important: Contains only the IDs of documents to remove, not the full document content
-
ID Generator (
fuzzy_id_generator.json):- JSON file containing ID generator state
- Required for removal workflow when IDs were auto-assigned
- Ensures consistent ID mapping across workflow stages
-
Cache Files (
cache_path/):- Intermediate results for debugging and analysis
- Can be reused if re-running with different parameters
- Clear cache between runs if parameters change significantly
Performance Considerations
Performance characteristics:
- GPU-accelerated MinHash and LSH operations
- Scales across multiple GPUs and nodes using Ray
bands_per_iterationcontrols memory usage- Intermediate results are cached for efficiency
GPU requirements:
- NVIDIA GPU with CUDA support
- Ray cluster with GPU workers
Performance tuning:
- Memory: Adjust
bands_per_iteration(lower = less memory, more iterations) - Accuracy: Use
char_ngrams >= 20to reduce false positives - Best practices: Clear cache between runs, use
input_blocksize="1GiB"
Note: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as bands_per_iteration, char_ngrams, and input_blocksize.
For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the Deduplication overview .