Curate TextProcess DataDeduplication

Fuzzy Duplicate Removal

View as Markdown

Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.

For other approaches, refer to Deduplication .

How It Works

Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:

  1. Computes MinHash signatures over character n-grams
  2. Uses Locality Sensitive Hashing (LSH) to find candidate matches
  3. Builds a graph of duplicate relationships
  4. Identifies groups of near-duplicate documents

Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.

Before You Start

Prerequisites:

  • Ray cluster with GPU support (required for distributed processing)
  • Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)

Running in Docker: When running fuzzy deduplication inside the NeMo Curator container, ensure the container is started with --gpus all so that Ray workers can access the GPU. Without GPU access, you may see CUDARuntimeError or AttributeError: 'CUDARuntimeError' object has no attribute 'msg'. Also activate the virtual environment with source /opt/venv/env.sh after entering the container.

Quick Start

Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:

1from nemo_curator.core.client import RayClient
2from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
3from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
4
5ray_client = RayClient()
6ray_client.start()
7
8# Step 1: Identify duplicates
9fuzzy_workflow = FuzzyDeduplicationWorkflow(
10 input_path="input_data/",
11 cache_path="./cache",
12 output_path="./results",
13 text_field="text",
14 perform_removal=False,
15 input_filetype="parquet",
16 char_ngrams=24,
17 num_bands=20,
18 minhashes_per_band=13
19)
20fuzzy_workflow.run()
21# Duplicate IDs saved to ./results/FuzzyDuplicateIds/
22
23# Step 2: Remove duplicates
24removal_workflow = TextDuplicatesRemovalWorkflow(
25 input_path="input_data/",
26 ids_to_remove_path="./results/FuzzyDuplicateIds",
27 output_path="./deduplicated",
28 input_filetype="parquet",
29 input_id_field="_curator_dedup_id",
30 ids_to_remove_duplicate_id_field="_curator_dedup_id",
31 id_generator_path="./results/fuzzy_id_generator.json"
32)
33removal_workflow.run()
34# Clean dataset saved to ./deduplicated/

Configuration

Configure fuzzy deduplication using these key parameters:

ParameterTypeDefaultDescription
input_pathstr | list[str]NonePath(s) to input files or directories
cache_pathstrRequiredDirectory to cache intermediate results
output_pathstrRequiredDirectory to write duplicate IDs and ID generator
text_fieldstr”text”Name of the text field in input data
char_ngramsint24Character n-gram size for MinHash (recommended: >= 20)
num_bandsint20Number of LSH bands (affects similarity threshold)
minhashes_per_bandint13Number of hashes per LSH band
bands_per_iterationint5Bands processed per iteration (memory tuning)
use_64_bit_hashboolFalseUse 64-bit hash (more memory, fewer collisions)
seedint42Random seed for MinHash permutations
input_filetypestr”parquet”Input file format (“parquet” or “jsonl”)
input_blocksizestr | int”1GiB”Size of input blocks for processing
perform_removalboolFalseReserved; must remain False. Fuzzy removal is performed with TextDuplicatesRemovalWorkflow.

Similarity Threshold

Control matching strictness with num_bands and minhashes_per_band:

  • More strict matching: Increase num_bands or decrease minhashes_per_band
  • Less strict matching: Decrease num_bands or increase minhashes_per_band

Default (num_bands=20, minhashes_per_band=13) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.

1# Example: stricter matching (fewer pairs detected, higher required similarity)
2fuzzy_workflow = FuzzyDeduplicationWorkflow(
3 num_bands=25, # More bands = stricter matching
4 minhashes_per_band=10 # Fewer hashes per band = stricter matching
5)
6
7# Example: less strict matching (more pairs detected, lower required similarity)
8fuzzy_workflow = FuzzyDeduplicationWorkflow(
9 num_bands=15, # Fewer bands = less strict matching
10 minhashes_per_band=15 # More hashes per band = less strict matching
11)

Removing Duplicates

After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:

1from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
2
3removal_workflow = TextDuplicatesRemovalWorkflow(
4 input_path="/path/to/input/data",
5 ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
6 output_path="/path/to/deduplicated",
7 input_filetype="parquet",
8 input_id_field="_curator_dedup_id",
9 ids_to_remove_duplicate_id_field="_curator_dedup_id",
10 id_generator_path="/path/to/output/fuzzy_id_generator.json" # Required if IDs were auto-assigned
11)
12removal_workflow.run()

When IDs were auto-assigned:

  • id_generator_path is required
  • Ensures consistent ID mapping between identification and removal stages

Output Format

The fuzzy deduplication process produces the following directory structure:

1cache_path/
2├── MinHashStage/ # MinHash signatures
3│ └── *.parquet
4├── LSHStage/ # LSH buckets
5│ └── *.parquet
6├── BucketsToEdges/ # Graph edges
7│ └── *.parquet
8└── ConnectedComponents/ # Connected components
9 └── *.parquet
10
11output_path/
12├── FuzzyDuplicateIds/ # Duplicate identification results
13│ └── *.parquet # Parquet files with document IDs to remove
14└── fuzzy_id_generator.json # ID generator mapping (if IDs were auto-assigned)

File Formats

The workflow produces these output files:

  1. Duplicate IDs (FuzzyDuplicateIds/*.parquet):

    • Contains document IDs to remove
    • Format: Parquet files with column: ["_curator_dedup_id"]
    • Important: Contains only the IDs of documents to remove, not the full document content
  2. ID Generator (fuzzy_id_generator.json):

    • JSON file containing ID generator state
    • Required for removal workflow when IDs were auto-assigned
    • Ensures consistent ID mapping across workflow stages
  3. Cache Files (cache_path/):

    • Intermediate results for debugging and analysis
    • Can be reused if re-running with different parameters
    • Clear cache between runs if parameters change significantly

Performance characteristics:

  • GPU-accelerated MinHash and LSH operations
  • Scales across multiple GPUs and nodes using Ray
  • bands_per_iteration controls memory usage
  • Intermediate results are cached for efficiency

GPU requirements:

  • NVIDIA GPU with CUDA support
  • Ray cluster with GPU workers

Performance tuning:

  • Memory: Adjust bands_per_iteration (lower = less memory, more iterations)
  • Accuracy: Use char_ngrams >= 20 to reduce false positives
  • Best practices: Clear cache between runs, use input_blocksize="1GiB"

Note: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as bands_per_iteration, char_ngrams, and input_blocksize.

For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the Deduplication overview .