***

description: >-
Identify and remove near-duplicate documents using MinHash and LSH with GPU
acceleration
categories:

* how-to-guides
  tags:
* fuzzy-dedup
* minhash
* lsh
* gpu
* ray
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: how-to
  modality: text-only

***

# Fuzzy Duplicate Removal

Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.

For other approaches, refer to [Deduplication ](/curate-text/process-data/deduplication).

## How It Works

Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:

1. Computes MinHash signatures over character n-grams
2. Uses Locality Sensitive Hashing (LSH) to find candidate matches
3. Builds a graph of duplicate relationships
4. Identifies groups of near-duplicate documents

Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.

## Before You Start

**Prerequisites**:

* Ray cluster with GPU support (required for distributed processing)
* Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)

<Note>
  **Running in Docker**: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with `--gpus all` so that Ray workers can access the GPU. Without GPU access, you may see `CUDARuntimeError` or `AttributeError: 'CUDARuntimeError' object has no attribute 'msg'`. Also activate the virtual environment with `source /opt/venv/env.sh` after entering the container.
</Note>

## Quick Start

Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:

```python
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow

ray_client = RayClient()
ray_client.start()

# Step 1: Identify duplicates
fuzzy_workflow = FuzzyDeduplicationWorkflow(
    input_path="input_data/",
    cache_path="./cache",
    output_path="./results",
    text_field="text",
    perform_removal=False,
    input_filetype="parquet",
    char_ngrams=24,
    num_bands=20,
    minhashes_per_band=13
)
fuzzy_workflow.run()
# Duplicate IDs saved to ./results/FuzzyDuplicateIds/

# Step 2: Remove duplicates
removal_workflow = TextDuplicatesRemovalWorkflow(
    input_path="input_data/",
    ids_to_remove_path="./results/FuzzyDuplicateIds",
    output_path="./deduplicated",
    input_filetype="parquet",
    input_id_field="_curator_dedup_id",
    ids_to_remove_duplicate_id_field="_curator_dedup_id",
    id_generator_path="./results/fuzzy_id_generator.json"
)
removal_workflow.run()
# Clean dataset saved to ./deduplicated/
```

## Configuration

Configure fuzzy deduplication using these key parameters:

| Parameter             | Type              | Default   | Description                                                                                     |
| --------------------- | ----------------- | --------- | ----------------------------------------------------------------------------------------------- |
| `input_path`          | str \| list\[str] | None      | Path(s) to input files or directories                                                           |
| `cache_path`          | str               | Required  | Directory to cache intermediate results                                                         |
| `output_path`         | str               | Required  | Directory to write duplicate IDs and ID generator                                               |
| `text_field`          | str               | "text"    | Name of the text field in input data                                                            |
| `char_ngrams`         | int               | 24        | Character n-gram size for MinHash (recommended: >= 20)                                          |
| `num_bands`           | int               | 20        | Number of LSH bands (affects similarity threshold)                                              |
| `minhashes_per_band`  | int               | 13        | Number of hashes per LSH band                                                                   |
| `bands_per_iteration` | int               | 5         | Bands processed per iteration (memory tuning)                                                   |
| `use_64_bit_hash`     | bool              | False     | Use 64-bit hash (more memory, fewer collisions)                                                 |
| `seed`                | int               | 42        | Random seed for MinHash permutations                                                            |
| `input_filetype`      | str               | "parquet" | Input file format ("parquet" or "jsonl")                                                        |
| `input_blocksize`     | str \| int        | "1GiB"    | Size of input blocks for processing                                                             |
| `perform_removal`     | bool              | False     | Reserved; must remain `False`. Fuzzy removal is performed with `TextDuplicatesRemovalWorkflow`. |

### Similarity Threshold

Control matching strictness with `num_bands` and `minhashes_per_band`:

* **More strict matching**: Increase `num_bands` or decrease `minhashes_per_band`
* **Less strict matching**: Decrease `num_bands` or increase `minhashes_per_band`

Default (`num_bands=20`, `minhashes_per_band=13`) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.

<Accordion title="Custom Similarity Threshold">
  ```python
  # Example: stricter matching (fewer pairs detected, higher required similarity)
  fuzzy_workflow = FuzzyDeduplicationWorkflow(
      num_bands=25,           # More bands = stricter matching
      minhashes_per_band=10  # Fewer hashes per band = stricter matching
  )

  # Example: less strict matching (more pairs detected, lower required similarity)
  fuzzy_workflow = FuzzyDeduplicationWorkflow(
      num_bands=15,           # Fewer bands = less strict matching
      minhashes_per_band=15  # More hashes per band = less strict matching
  )
  ```
</Accordion>

## Removing Duplicates

After identifying duplicates, use `TextDuplicatesRemovalWorkflow` to remove them:

```python
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow

removal_workflow = TextDuplicatesRemovalWorkflow(
    input_path="/path/to/input/data",
    ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
    output_path="/path/to/deduplicated",
    input_filetype="parquet",
    input_id_field="_curator_dedup_id",
    ids_to_remove_duplicate_id_field="_curator_dedup_id",
    id_generator_path="/path/to/output/fuzzy_id_generator.json"  # Required if IDs were auto-assigned
)
removal_workflow.run()
```

<Accordion title="ID Field Configuration">
  **When IDs were auto-assigned**:

  * `id_generator_path` is required
  * Ensures consistent ID mapping between identification and removal stages
</Accordion>

## Output Format

The fuzzy deduplication process produces the following directory structure:

```s
cache_path/
├── MinHashStage/                    # MinHash signatures
│   └── *.parquet
├── LSHStage/                        # LSH buckets
│   └── *.parquet
├── BucketsToEdges/                  # Graph edges
│   └── *.parquet
└── ConnectedComponents/             # Connected components
    └── *.parquet

output_path/
├── FuzzyDuplicateIds/               # Duplicate identification results
│   └── *.parquet                    # Parquet files with document IDs to remove
└── fuzzy_id_generator.json          # ID generator mapping (if IDs were auto-assigned)
```

### File Formats

The workflow produces these output files:

1. **Duplicate IDs** (`FuzzyDuplicateIds/*.parquet`):
   * Contains document IDs to remove
   * Format: Parquet files with column: `["_curator_dedup_id"]`
   * **Important**: Contains only the IDs of documents to remove, not the full document content

2. **ID Generator** (`fuzzy_id_generator.json`):
   * JSON file containing ID generator state
   * Required for removal workflow when IDs were auto-assigned
   * Ensures consistent ID mapping across workflow stages

3. **Cache Files** (`cache_path/`):
   * Intermediate results for debugging and analysis
   * Can be reused if re-running with different parameters
   * Clear cache between runs if parameters change significantly

<Accordion title="Performance Considerations">
  **Performance characteristics**:

  * GPU-accelerated MinHash and LSH operations
  * Scales across multiple GPUs and nodes using Ray
  * `bands_per_iteration` controls memory usage
  * Intermediate results are cached for efficiency

  **GPU requirements**:

  * NVIDIA GPU with CUDA support
  * Ray cluster with GPU workers

  **Performance tuning**:

  * **Memory**: Adjust `bands_per_iteration` (lower = less memory, more iterations)
  * **Accuracy**: Use `char_ngrams &gt;= 20` to reduce false positives
  * **Best practices**: Clear cache between runs, use `input_blocksize="1GiB"`

  **Note**: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as `bands_per_iteration`, `char_ngrams`, and `input_blocksize`.
</Accordion>

For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the [Deduplication overview ](/curate-text/process-data/deduplication).
