*** description: >- Identify and remove near-duplicate documents using MinHash and LSH with GPU acceleration categories: * how-to-guides tags: * fuzzy-dedup * minhash * lsh * gpu * ray personas: * data-scientist-focused * mle-focused difficulty: intermediate content\_type: how-to modality: text-only *** # Fuzzy Duplicate Removal Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU. For other approaches, refer to [Deduplication ](/curate-text/process-data/deduplication). ## How It Works Fuzzy deduplication uses MinHash and LSH to find near-duplicate content: 1. Computes MinHash signatures over character n-grams 2. Uses Locality Sensitive Hashing (LSH) to find candidate matches 3. Builds a graph of duplicate relationships 4. Identifies groups of near-duplicate documents Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content. ## Before You Start **Prerequisites**: * Ray cluster with GPU support (required for distributed processing) * Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages) **Running in Docker**: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with `--gpus all` so that Ray workers can access the GPU. Without GPU access, you may see `CUDARuntimeError` or `AttributeError: 'CUDARuntimeError' object has no attribute 'msg'`. Also activate the virtual environment with `source /opt/venv/env.sh` after entering the container. ## Quick Start Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them: ```python from nemo_curator.core.client import RayClient from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow ray_client = RayClient() ray_client.start() # Step 1: Identify duplicates fuzzy_workflow = FuzzyDeduplicationWorkflow( input_path="input_data/", cache_path="./cache", output_path="./results", text_field="text", perform_removal=False, input_filetype="parquet", char_ngrams=24, num_bands=20, minhashes_per_band=13 ) fuzzy_workflow.run() # Duplicate IDs saved to ./results/FuzzyDuplicateIds/ # Step 2: Remove duplicates removal_workflow = TextDuplicatesRemovalWorkflow( input_path="input_data/", ids_to_remove_path="./results/FuzzyDuplicateIds", output_path="./deduplicated", input_filetype="parquet", input_id_field="_curator_dedup_id", ids_to_remove_duplicate_id_field="_curator_dedup_id", id_generator_path="./results/fuzzy_id_generator.json" ) removal_workflow.run() # Clean dataset saved to ./deduplicated/ ``` ## Configuration Configure fuzzy deduplication using these key parameters: | Parameter | Type | Default | Description | | --------------------- | ----------------- | --------- | ----------------------------------------------------------------------------------------------- | | `input_path` | str \| list\[str] | None | Path(s) to input files or directories | | `cache_path` | str | Required | Directory to cache intermediate results | | `output_path` | str | Required | Directory to write duplicate IDs and ID generator | | `text_field` | str | "text" | Name of the text field in input data | | `char_ngrams` | int | 24 | Character n-gram size for MinHash (recommended: >= 20) | | `num_bands` | int | 20 | Number of LSH bands (affects similarity threshold) | | `minhashes_per_band` | int | 13 | Number of hashes per LSH band | | `bands_per_iteration` | int | 5 | Bands processed per iteration (memory tuning) | | `use_64_bit_hash` | bool | False | Use 64-bit hash (more memory, fewer collisions) | | `seed` | int | 42 | Random seed for MinHash permutations | | `input_filetype` | str | "parquet" | Input file format ("parquet" or "jsonl") | | `input_blocksize` | str \| int | "1GiB" | Size of input blocks for processing | | `perform_removal` | bool | False | Reserved; must remain `False`. Fuzzy removal is performed with `TextDuplicatesRemovalWorkflow`. | ### Similarity Threshold Control matching strictness with `num_bands` and `minhashes_per_band`: * **More strict matching**: Increase `num_bands` or decrease `minhashes_per_band` * **Less strict matching**: Decrease `num_bands` or increase `minhashes_per_band` Default (`num_bands=20`, `minhashes_per_band=13`) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution. ```python # Example: stricter matching (fewer pairs detected, higher required similarity) fuzzy_workflow = FuzzyDeduplicationWorkflow( num_bands=25, # More bands = stricter matching minhashes_per_band=10 # Fewer hashes per band = stricter matching ) # Example: less strict matching (more pairs detected, lower required similarity) fuzzy_workflow = FuzzyDeduplicationWorkflow( num_bands=15, # Fewer bands = less strict matching minhashes_per_band=15 # More hashes per band = less strict matching ) ``` ## Removing Duplicates After identifying duplicates, use `TextDuplicatesRemovalWorkflow` to remove them: ```python from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow removal_workflow = TextDuplicatesRemovalWorkflow( input_path="/path/to/input/data", ids_to_remove_path="/path/to/output/FuzzyDuplicateIds", output_path="/path/to/deduplicated", input_filetype="parquet", input_id_field="_curator_dedup_id", ids_to_remove_duplicate_id_field="_curator_dedup_id", id_generator_path="/path/to/output/fuzzy_id_generator.json" # Required if IDs were auto-assigned ) removal_workflow.run() ``` **When IDs were auto-assigned**: * `id_generator_path` is required * Ensures consistent ID mapping between identification and removal stages ## Output Format The fuzzy deduplication process produces the following directory structure: ```s cache_path/ ├── MinHashStage/ # MinHash signatures │ └── *.parquet ├── LSHStage/ # LSH buckets │ └── *.parquet ├── BucketsToEdges/ # Graph edges │ └── *.parquet └── ConnectedComponents/ # Connected components └── *.parquet output_path/ ├── FuzzyDuplicateIds/ # Duplicate identification results │ └── *.parquet # Parquet files with document IDs to remove └── fuzzy_id_generator.json # ID generator mapping (if IDs were auto-assigned) ``` ### File Formats The workflow produces these output files: 1. **Duplicate IDs** (`FuzzyDuplicateIds/*.parquet`): * Contains document IDs to remove * Format: Parquet files with column: `["_curator_dedup_id"]` * **Important**: Contains only the IDs of documents to remove, not the full document content 2. **ID Generator** (`fuzzy_id_generator.json`): * JSON file containing ID generator state * Required for removal workflow when IDs were auto-assigned * Ensures consistent ID mapping across workflow stages 3. **Cache Files** (`cache_path/`): * Intermediate results for debugging and analysis * Can be reused if re-running with different parameters * Clear cache between runs if parameters change significantly **Performance characteristics**: * GPU-accelerated MinHash and LSH operations * Scales across multiple GPUs and nodes using Ray * `bands_per_iteration` controls memory usage * Intermediate results are cached for efficiency **GPU requirements**: * NVIDIA GPU with CUDA support * Ray cluster with GPU workers **Performance tuning**: * **Memory**: Adjust `bands_per_iteration` (lower = less memory, more iterations) * **Accuracy**: Use `char_ngrams >= 20` to reduce false positives * **Best practices**: Clear cache between runs, use `input_blocksize="1GiB"` **Note**: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as `bands_per_iteration`, `char_ngrams`, and `input_blocksize`. For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the [Deduplication overview ](/curate-text/process-data/deduplication).