***
description: >-
Identify and remove near-duplicate documents using MinHash and LSH with GPU
acceleration
categories:
* how-to-guides
tags:
* fuzzy-dedup
* minhash
* lsh
* gpu
* ray
personas:
* data-scientist-focused
* mle-focused
difficulty: intermediate
content\_type: how-to
modality: text-only
***
# Fuzzy Duplicate Removal
Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.
For other approaches, refer to [Deduplication ](/curate-text/process-data/deduplication).
## How It Works
Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:
1. Computes MinHash signatures over character n-grams
2. Uses Locality Sensitive Hashing (LSH) to find candidate matches
3. Builds a graph of duplicate relationships
4. Identifies groups of near-duplicate documents
Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.
## Before You Start
**Prerequisites**:
* Ray cluster with GPU support (required for distributed processing)
* Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)
**Running in Docker**: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with `--gpus all` so that Ray workers can access the GPU. Without GPU access, you may see `CUDARuntimeError` or `AttributeError: 'CUDARuntimeError' object has no attribute 'msg'`. Also activate the virtual environment with `source /opt/venv/env.sh` after entering the container.
## Quick Start
Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:
```python
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
ray_client = RayClient()
ray_client.start()
# Step 1: Identify duplicates
fuzzy_workflow = FuzzyDeduplicationWorkflow(
input_path="input_data/",
cache_path="./cache",
output_path="./results",
text_field="text",
perform_removal=False,
input_filetype="parquet",
char_ngrams=24,
num_bands=20,
minhashes_per_band=13
)
fuzzy_workflow.run()
# Duplicate IDs saved to ./results/FuzzyDuplicateIds/
# Step 2: Remove duplicates
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="input_data/",
ids_to_remove_path="./results/FuzzyDuplicateIds",
output_path="./deduplicated",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="./results/fuzzy_id_generator.json"
)
removal_workflow.run()
# Clean dataset saved to ./deduplicated/
```
## Configuration
Configure fuzzy deduplication using these key parameters:
| Parameter | Type | Default | Description |
| --------------------- | ----------------- | --------- | ----------------------------------------------------------------------------------------------- |
| `input_path` | str \| list\[str] | None | Path(s) to input files or directories |
| `cache_path` | str | Required | Directory to cache intermediate results |
| `output_path` | str | Required | Directory to write duplicate IDs and ID generator |
| `text_field` | str | "text" | Name of the text field in input data |
| `char_ngrams` | int | 24 | Character n-gram size for MinHash (recommended: >= 20) |
| `num_bands` | int | 20 | Number of LSH bands (affects similarity threshold) |
| `minhashes_per_band` | int | 13 | Number of hashes per LSH band |
| `bands_per_iteration` | int | 5 | Bands processed per iteration (memory tuning) |
| `use_64_bit_hash` | bool | False | Use 64-bit hash (more memory, fewer collisions) |
| `seed` | int | 42 | Random seed for MinHash permutations |
| `input_filetype` | str | "parquet" | Input file format ("parquet" or "jsonl") |
| `input_blocksize` | str \| int | "1GiB" | Size of input blocks for processing |
| `perform_removal` | bool | False | Reserved; must remain `False`. Fuzzy removal is performed with `TextDuplicatesRemovalWorkflow`. |
### Similarity Threshold
Control matching strictness with `num_bands` and `minhashes_per_band`:
* **More strict matching**: Increase `num_bands` or decrease `minhashes_per_band`
* **Less strict matching**: Decrease `num_bands` or increase `minhashes_per_band`
Default (`num_bands=20`, `minhashes_per_band=13`) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.
```python
# Example: stricter matching (fewer pairs detected, higher required similarity)
fuzzy_workflow = FuzzyDeduplicationWorkflow(
num_bands=25, # More bands = stricter matching
minhashes_per_band=10 # Fewer hashes per band = stricter matching
)
# Example: less strict matching (more pairs detected, lower required similarity)
fuzzy_workflow = FuzzyDeduplicationWorkflow(
num_bands=15, # Fewer bands = less strict matching
minhashes_per_band=15 # More hashes per band = less strict matching
)
```
## Removing Duplicates
After identifying duplicates, use `TextDuplicatesRemovalWorkflow` to remove them:
```python
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="/path/to/input/data",
ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
output_path="/path/to/deduplicated",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="/path/to/output/fuzzy_id_generator.json" # Required if IDs were auto-assigned
)
removal_workflow.run()
```
**When IDs were auto-assigned**:
* `id_generator_path` is required
* Ensures consistent ID mapping between identification and removal stages
## Output Format
The fuzzy deduplication process produces the following directory structure:
```s
cache_path/
├── MinHashStage/ # MinHash signatures
│ └── *.parquet
├── LSHStage/ # LSH buckets
│ └── *.parquet
├── BucketsToEdges/ # Graph edges
│ └── *.parquet
└── ConnectedComponents/ # Connected components
└── *.parquet
output_path/
├── FuzzyDuplicateIds/ # Duplicate identification results
│ └── *.parquet # Parquet files with document IDs to remove
└── fuzzy_id_generator.json # ID generator mapping (if IDs were auto-assigned)
```
### File Formats
The workflow produces these output files:
1. **Duplicate IDs** (`FuzzyDuplicateIds/*.parquet`):
* Contains document IDs to remove
* Format: Parquet files with column: `["_curator_dedup_id"]`
* **Important**: Contains only the IDs of documents to remove, not the full document content
2. **ID Generator** (`fuzzy_id_generator.json`):
* JSON file containing ID generator state
* Required for removal workflow when IDs were auto-assigned
* Ensures consistent ID mapping across workflow stages
3. **Cache Files** (`cache_path/`):
* Intermediate results for debugging and analysis
* Can be reused if re-running with different parameters
* Clear cache between runs if parameters change significantly
**Performance characteristics**:
* GPU-accelerated MinHash and LSH operations
* Scales across multiple GPUs and nodes using Ray
* `bands_per_iteration` controls memory usage
* Intermediate results are cached for efficiency
**GPU requirements**:
* NVIDIA GPU with CUDA support
* Ray cluster with GPU workers
**Performance tuning**:
* **Memory**: Adjust `bands_per_iteration` (lower = less memory, more iterations)
* **Accuracy**: Use `char_ngrams >= 20` to reduce false positives
* **Best practices**: Clear cache between runs, use `input_blocksize="1GiB"`
**Note**: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as `bands_per_iteration`, `char_ngrams`, and `input_blocksize`.
For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the [Deduplication overview ](/curate-text/process-data/deduplication).