Fuzzy Deduplication | NeMo Curator

Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.

For other approaches, refer to Deduplication .

How It Works

Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:

Computes MinHash signatures over character n-grams
Uses Locality Sensitive Hashing (LSH) to find candidate matches
Builds a graph of duplicate relationships
Identifies groups of near-duplicate documents

Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.

Before You Start

Prerequisites:

Ray cluster with GPU support (required for distributed processing)
Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)

Running in Docker: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with --gpus all so that Ray workers can access the GPU. Without GPU access, you may see CUDARuntimeError or AttributeError: 'CUDARuntimeError' object has no attribute 'msg'. Also activate the virtual environment with source /opt/venv/env.sh after entering the container.

Quick Start

Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:

1 from nemo_curator.core.client import RayClient
2 from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
3 from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
4 
5 ray_client = RayClient()
6 ray_client.start()
7 
8 # Step 1: Identify duplicates
9 fuzzy_workflow = FuzzyDeduplicationWorkflow(
10     input_path="input_data/",
11     cache_path="./cache",
12     output_path="./results",
13     text_field="text",
14     perform_removal=False,
15     input_filetype="parquet",
16     char_ngrams=24,
17     num_bands=20,
18     minhashes_per_band=13
19 )
20 result = fuzzy_workflow.run()
21 # result.metadata contains: total_time, num_duplicates, minhash_time, lsh_time, connected_components_pipeline_time, id_generator_path
22 
23 # Step 2: Remove duplicates
24 removal_workflow = TextDuplicatesRemovalWorkflow(
25     input_path="input_data/",
26     ids_to_remove_path="./results/FuzzyDuplicateIds",
27     output_path="./deduplicated",
28     input_filetype="parquet",
29     input_id_field="_curator_dedup_id",
30     ids_to_remove_duplicate_id_field="_curator_dedup_id",
31     id_generator_path="./results/fuzzy_id_generator.json"
32 )
33 result = removal_workflow.run()
34 # result.metadata contains: total_time, num_duplicates_removed

Configuration

Configure fuzzy deduplication using these key parameters:

Parameter	Type	Default	Description
`input_path`	str \| list[str]	None	Path(s) to input files or directories
`cache_path`	str	Required	Directory to cache intermediate results
`output_path`	str	Required	Directory to write duplicate IDs and ID generator
`text_field`	str	”text”	Name of the text field in input data
`char_ngrams`	int	24	Character n-gram size for MinHash (recommended: >= 20)
`num_bands`	int	20	Number of LSH bands (affects similarity threshold)
`minhashes_per_band`	int	13	Number of hashes per LSH band
`bands_per_iteration`	int	5	Bands processed per iteration (memory tuning)
`use_64_bit_hash`	bool	False	Use 64-bit hash (more memory, fewer collisions)
`seed`	int	42	Random seed for MinHash permutations
`input_filetype`	str	”parquet”	Input file format (“parquet” or “jsonl”)
`input_blocksize`	str \| int	”1GiB”	Size of input blocks for processing
`lsh_num_output_partitions`	int \| None	None	Total number of partitions to write during the LSH shuffle. If `None`, the partition count is chosen automatically as the closest power of 2 <= the number of input tasks.
`lsh_rmm_pool_size`	int \| “auto” \| None	”auto”	Size of the RMM GPU memory pool in bytes for the LSH stage. `"auto"` sets the pool to 90% of free GPU memory. `None` sets the pool to 50% of free GPU memory and allows expansion.
`lsh_spill_memory_limit`	int \| “auto” \| None	”auto”	Device memory limit in bytes for spilling to host during the LSH stage. `"auto"` sets the limit to 80% of the RMM pool size. `None` disables spilling.
`perform_removal`	bool	False	Reserved; must remain `False`. Fuzzy removal is performed with `TextDuplicatesRemovalWorkflow`.

Similarity Threshold

Control matching strictness with num_bands and minhashes_per_band:

More strict matching: Increase num_bands or decrease minhashes_per_band
Less strict matching: Decrease num_bands or increase minhashes_per_band

Default (num_bands=20, minhashes_per_band=13) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.

Custom Similarity Threshold

1 # Example: stricter matching (fewer pairs detected, higher required similarity)
2 fuzzy_workflow = FuzzyDeduplicationWorkflow(
3     num_bands=25,           # More bands = stricter matching
4     minhashes_per_band=10  # Fewer hashes per band = stricter matching
5 )
6 
7 # Example: less strict matching (more pairs detected, lower required similarity)
8 fuzzy_workflow = FuzzyDeduplicationWorkflow(
9     num_bands=15,           # Fewer bands = less strict matching
10     minhashes_per_band=15  # More hashes per band = less strict matching
11 )

Removing Duplicates

After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:

1 from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
2 
3 removal_workflow = TextDuplicatesRemovalWorkflow(
4     input_path="/path/to/input/data",
5     ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
6     output_path="/path/to/deduplicated",
7     input_filetype="parquet",
8     input_id_field="_curator_dedup_id",
9     ids_to_remove_duplicate_id_field="_curator_dedup_id",
10     id_generator_path="/path/to/output/fuzzy_id_generator.json"  # Required if IDs were auto-assigned
11 )
12 result = removal_workflow.run()

ID Field Configuration

When IDs were auto-assigned:

id_generator_path is required
Ensures consistent ID mapping between identification and removal stages

Output Format

The fuzzy deduplication process produces the following directory structure:

cache_path/
├── MinHashStage/                    # MinHash signatures
│   └── *.parquet
├── LSHStage/                        # LSH buckets
│   └── *.parquet
├── BucketsToEdges/                  # Graph edges
│   └── *.parquet
└── ConnectedComponents/             # Connected components
    └── *.parquet
output_path/
├── FuzzyDuplicateIds/               # Duplicate identification results
│   └── *.parquet                    # Parquet files with document IDs to remove
└── fuzzy_id_generator.json          # ID generator mapping (if IDs were auto-assigned)

File Formats

The workflow produces these output files:

Duplicate IDs (FuzzyDuplicateIds/*.parquet):
- Contains document IDs to remove
- Format: Parquet files with column: ["_curator_dedup_id"]
- Important: Contains only the IDs of documents to remove, not the full document content
ID Generator (fuzzy_id_generator.json):
- JSON file containing ID generator state
- Required for removal workflow when IDs were auto-assigned
- Ensures consistent ID mapping across workflow stages
Cache Files (cache_path/):
- Intermediate results for debugging and analysis
- Can be reused if re-running with different parameters
- Clear cache between runs if parameters change significantly

Performance Considerations

Performance characteristics:

GPU-accelerated MinHash and LSH operations
Scales across multiple GPUs and nodes using Ray
bands_per_iteration controls memory usage
Intermediate results are cached for efficiency

GPU requirements:

NVIDIA GPU with CUDA support
Ray cluster with GPU workers

Performance tuning:

Memory: Adjust bands_per_iteration (lower = less memory, more iterations)
GPU memory (LSH): Use lsh_rmm_pool_size to control GPU memory allocation and lsh_spill_memory_limit to tune host-spilling behavior during the LSH stage. Reducing the pool size or lowering the spill threshold can prevent out-of-memory errors on smaller GPUs.
Shuffle partitions: Set lsh_num_output_partitions to control the number of output partitions during the LSH shuffle. More partitions reduce per-partition memory but increase I/O overhead.
Accuracy: Use char_ngrams >= 20 to reduce false positives
Best practices: Clear cache between runs, use input_blocksize="1GiB"

Note: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as bands_per_iteration, char_ngrams, and input_blocksize.

For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the Deduplication overview .