Fuzzy Duplicate Removal#
Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.
For other approaches, refer to Deduplication.
How It Works#
Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:
Computes MinHash signatures over character n-grams
Uses Locality Sensitive Hashing (LSH) to find candidate matches
Builds a graph of duplicate relationships
Identifies groups of near-duplicate documents
Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.
Before You Start#
Prerequisites:
Ray cluster with GPU support (required for distributed processing)
Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)
Note
Running in Docker: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with --gpus all so that Ray workers can access the GPU. Without GPU access, you may see CUDARuntimeError or AttributeError: 'CUDARuntimeError' object has no attribute 'msg'. Also activate the virtual environment with source /opt/venv/env.sh after entering the container.
Quick Start#
Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
ray_client = RayClient()
ray_client.start()
# Step 1: Identify duplicates
fuzzy_workflow = FuzzyDeduplicationWorkflow(
input_path="input_data/",
cache_path="./cache",
output_path="./results",
text_field="text",
perform_removal=False,
input_filetype="parquet",
char_ngrams=24,
num_bands=20,
minhashes_per_band=13
)
fuzzy_workflow.run()
# Duplicate IDs saved to ./results/FuzzyDuplicateIds/
# Step 2: Remove duplicates
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="input_data/",
ids_to_remove_path="./results/FuzzyDuplicateIds",
output_path="./deduplicated",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="./results/fuzzy_id_generator.json"
)
removal_workflow.run()
# Clean dataset saved to ./deduplicated/
Configuration#
Configure fuzzy deduplication using these key parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str | list[str] |
None |
Path(s) to input files or directories |
|
str |
Required |
Directory to cache intermediate results |
|
str |
Required |
Directory to write duplicate IDs and ID generator |
|
str |
“text” |
Name of the text field in input data |
|
int |
24 |
Character n-gram size for MinHash (recommended: >= 20) |
|
int |
20 |
Number of LSH bands (affects similarity threshold) |
|
int |
13 |
Number of hashes per LSH band |
|
int |
5 |
Bands processed per iteration (memory tuning) |
|
bool |
False |
Use 64-bit hash (more memory, fewer collisions) |
|
int |
42 |
Random seed for MinHash permutations |
|
str |
“parquet” |
Input file format (“parquet” or “jsonl”) |
|
str | int |
“1GiB” |
Size of input blocks for processing |
|
bool |
False |
Reserved; must remain |
Similarity Threshold#
Control matching strictness with num_bands and minhashes_per_band:
More strict matching: Increase
num_bandsor decreaseminhashes_per_bandLess strict matching: Decrease
num_bandsor increaseminhashes_per_band
Default (num_bands=20, minhashes_per_band=13) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.
Custom Similarity Threshold
# Example: stricter matching (fewer pairs detected, higher required similarity)
fuzzy_workflow = FuzzyDeduplicationWorkflow(
num_bands=25, # More bands = stricter matching
minhashes_per_band=10 # Fewer hashes per band = stricter matching
)
# Example: less strict matching (more pairs detected, lower required similarity)
fuzzy_workflow = FuzzyDeduplicationWorkflow(
num_bands=15, # Fewer bands = less strict matching
minhashes_per_band=15 # More hashes per band = less strict matching
)
Removing Duplicates#
After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="/path/to/input/data",
ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
output_path="/path/to/deduplicated",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="/path/to/output/fuzzy_id_generator.json" # Required if IDs were auto-assigned
)
removal_workflow.run()
ID Field Configuration
When IDs were auto-assigned:
id_generator_pathis requiredEnsures consistent ID mapping between identification and removal stages
Output Format#
The fuzzy deduplication process produces the following directory structure:
cache_path/
├── MinHashStage/ # MinHash signatures
│ └── *.parquet
├── LSHStage/ # LSH buckets
│ └── *.parquet
├── BucketsToEdges/ # Graph edges
│ └── *.parquet
└── ConnectedComponents/ # Connected components
└── *.parquet
output_path/
├── FuzzyDuplicateIds/ # Duplicate identification results
│ └── *.parquet # Parquet files with document IDs to remove
└── fuzzy_id_generator.json # ID generator mapping (if IDs were auto-assigned)
File Formats#
The workflow produces these output files:
Duplicate IDs (
FuzzyDuplicateIds/*.parquet):Contains document IDs to remove
Format: Parquet files with column:
["_curator_dedup_id"]Important: Contains only the IDs of documents to remove, not the full document content
ID Generator (
fuzzy_id_generator.json):JSON file containing ID generator state
Required for removal workflow when IDs were auto-assigned
Ensures consistent ID mapping across workflow stages
Cache Files (
cache_path/):Intermediate results for debugging and analysis
Can be reused if re-running with different parameters
Clear cache between runs if parameters change significantly
Performance Considerations
Performance characteristics:
GPU-accelerated MinHash and LSH operations
Scales across multiple GPUs and nodes using Ray
bands_per_iterationcontrols memory usageIntermediate results are cached for efficiency
GPU requirements:
NVIDIA GPU with CUDA support
Ray cluster with GPU workers
Performance tuning:
Memory: Adjust
bands_per_iteration(lower = less memory, more iterations)Accuracy: Use
char_ngrams >= 20to reduce false positivesBest practices: Clear cache between runs, use
input_blocksize="1GiB"
Note: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as bands_per_iteration, char_ngrams, and input_blocksize.
For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the Deduplication overview.