Fuzzy Duplicate Removal#
Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.
For other approaches, refer to Deduplication.
How It Works#
Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:
Computes MinHash signatures over character n-grams
Uses Locality Sensitive Hashing (LSH) to find candidate matches
Builds a graph of duplicate relationships
Identifies groups of near-duplicate documents
Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.
Before You Start#
Prerequisites:
Ray cluster with GPU support (required for distributed processing)
Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)
Adding Document IDs
If your broader pipeline does not already manage IDs, you can add them with the AddId stage:
from nemo_curator.stages.text.modules import AddId
from nemo_curator.pipeline import Pipeline
pipeline = Pipeline(name="add_ids_for_dedup")
pipeline.add_stage(
AddId(
id_field="doc_id",
id_prefix="corpus" # Optional prefix
)
)
For more details, refer to Adding Document IDs.
Quick Start#
Get started with fuzzy deduplication using these examples:
Identify duplicates, then remove them:
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
# Step 1: Identify duplicates
fuzzy_workflow = FuzzyDeduplicationWorkflow(
input_path="input_data/",
cache_path="./cache",
output_path="./results",
text_field="text",
perform_removal=False,
input_filetype="parquet",
char_ngrams=24,
num_bands=20,
minhashes_per_band=13
)
fuzzy_workflow.run()
# Duplicate IDs saved to ./results/FuzzyDuplicateIds/
# Step 2: Remove duplicates
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="input_data/",
ids_to_remove_path="./results/FuzzyDuplicateIds",
output_path="./deduplicated",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="./results/fuzzy_id_generator.json"
)
removal_workflow.run()
# Clean dataset saved to ./deduplicated/
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
fuzzy_workflow = FuzzyDeduplicationWorkflow(
input_path="input_data/",
cache_path="./cache",
output_path="./results",
text_field="text",
perform_removal=False
)
fuzzy_workflow.run()
Configuration#
Configure fuzzy deduplication using these key parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
str | list[str] |
None |
Path(s) to input files or directories |
|
str |
Required |
Directory to cache intermediate results |
|
str |
Required |
Directory to write duplicate IDs and ID generator |
|
str |
“text” |
Name of the text field in input data |
|
int |
24 |
Character n-gram size for MinHash (recommended: >= 20) |
|
int |
20 |
Number of LSH bands (affects similarity threshold) |
|
int |
13 |
Number of hashes per LSH band |
|
int |
5 |
Bands processed per iteration (memory tuning) |
|
bool |
False |
Use 64-bit hash (more memory, fewer collisions) |
|
int |
42 |
Random seed for MinHash permutations |
|
str |
“parquet” |
Input file format (“parquet” or “jsonl”) |
|
str | int |
“1GiB” |
Size of input blocks for processing |
|
bool |
False |
Reserved; must remain |
Similarity Threshold#
Control the strictness of matching with num_bands and minhashes_per_band:
More strict matching: Increase
num_bandsor decreaseminhashes_per_bandLess strict matching: Decrease
num_bandsor increaseminhashes_per_band
Default (num_bands=20, minhashes_per_band=13) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.
Custom Similarity Threshold
# Example: stricter matching (fewer pairs detected, higher required similarity)
fuzzy_workflow = FuzzyDeduplicationWorkflow(
num_bands=25, # More bands = stricter matching
minhashes_per_band=10 # Fewer hashes per band = stricter matching
)
# Example: less strict matching (more pairs detected, lower required similarity)
fuzzy_workflow = FuzzyDeduplicationWorkflow(
num_bands=15, # Fewer bands = less strict matching
minhashes_per_band=15 # More hashes per band = less strict matching
)
Removing Duplicates#
After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="/path/to/input/data",
ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
output_path="/path/to/deduplicated",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="/path/to/output/fuzzy_id_generator.json" # Required if IDs were auto-assigned
)
removal_workflow.run()
ID Field Configuration
When IDs were auto-assigned:
id_generator_pathis requiredEnsures consistent ID mapping between identification and removal stages
Output Format#
The fuzzy deduplication process produces the following directory structure:
cache_path/
├── MinHashStage/ # MinHash signatures
│ └── *.parquet
├── LSHStage/ # LSH buckets
│ └── *.parquet
├── BucketsToEdges/ # Graph edges
│ └── *.parquet
└── ConnectedComponents/ # Connected components
└── *.parquet
output_path/
├── FuzzyDuplicateIds/ # Duplicate identification results
│ └── *.parquet # Parquet files with document IDs to remove
└── fuzzy_id_generator.json # ID generator mapping (if IDs were auto-assigned)
File Formats#
The workflow produces these output files:
Duplicate IDs (
FuzzyDuplicateIds/*.parquet):Contains document IDs to remove
Format: Parquet files with column:
["_curator_dedup_id"]Important: Contains only the IDs of documents to remove, not the full document content
ID Generator (
fuzzy_id_generator.json):JSON file containing ID generator state
Required for removal workflow when IDs were auto-assigned
Ensures consistent ID mapping across workflow stages
Cache Files (
cache_path/):Intermediate results for debugging and analysis
Can be reused if re-running with different parameters
Clear cache between runs if parameters change significantly
Performance Considerations
Performance characteristics:
GPU-accelerated MinHash and LSH operations
Scales across multiple GPUs and nodes using Ray
bands_per_iterationcontrols memory usageIntermediate results are cached for efficiency
GPU requirements:
NVIDIA GPU with CUDA support
Ray cluster with GPU workers
Performance tuning:
Memory: Adjust
bands_per_iteration(lower = less memory, more iterations)Accuracy: Use
char_ngrams >= 20to reduce false positivesBest practices: Clear cache between runs, use
input_blocksize="1GiB"
Note: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as bands_per_iteration, char_ngrams, and input_blocksize.
Advanced Usage
Cloud storage configuration:
fuzzy_workflow = FuzzyDeduplicationWorkflow(
input_path="s3://bucket/input/",
cache_path="s3://bucket/cache/",
output_path="s3://bucket/output/",
read_kwargs={
"storage_options": {
"key": "<access_key>",
"secret": "<secret_key>",
"endpoint_url": "<endpoint_url>"
}
},
# ... other parameters
)
For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the Deduplication overview.