Curate TextProcess DataDeduplication

Fuzzy Duplicate Removal

View as Markdown

Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.

For other approaches, refer to Deduplication.

How It Works

Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:

  1. Computes MinHash signatures over character n-grams
  2. Uses Locality Sensitive Hashing (LSH) to find candidate matches
  3. Builds a graph of duplicate relationships
  4. Identifies groups of near-duplicate documents

Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.

Before You Start

Prerequisites:

  • Ray cluster with GPU support (required for distributed processing)
  • Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)

Quick Start

Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:

1from nemo_curator.core.client import RayClient
2from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
3from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
4
5ray_client = RayClient()
6ray_client.start()
7
8# Step 1: Identify duplicates
9fuzzy_workflow = FuzzyDeduplicationWorkflow(
10 input_path="input_data/",
11 cache_path="./cache",
12 output_path="./results",
13 text_field="text",
14 perform_removal=False,
15 input_filetype="parquet",
16 char_ngrams=24,
17 num_bands=20,
18 minhashes_per_band=13
19)
20fuzzy_workflow.run()
21# Duplicate IDs saved to ./results/FuzzyDuplicateIds/
22
23# Step 2: Remove duplicates
24removal_workflow = TextDuplicatesRemovalWorkflow(
25 input_path="input_data/",
26 ids_to_remove_path="./results/FuzzyDuplicateIds",
27 output_path="./deduplicated",
28 input_filetype="parquet",
29 input_id_field="_curator_dedup_id",
30 ids_to_remove_duplicate_id_field="_curator_dedup_id",
31 id_generator_path="./results/fuzzy_id_generator.json"
32)
33removal_workflow.run()
34# Clean dataset saved to ./deduplicated/

Configuration

Configure fuzzy deduplication using these key parameters:

Key Configuration Parameters

ParameterTypeDefaultDescription
input_pathstrlist[str]None
cache_pathstrRequiredDirectory to cache intermediate results
output_pathstrRequiredDirectory to write duplicate IDs and ID generator
text_fieldstr”text”Name of the text field in input data
char_ngramsint24Character n-gram size for MinHash (recommended: >= 20)
num_bandsint20Number of LSH bands (affects similarity threshold)
minhashes_per_bandint13Number of hashes per LSH band
bands_per_iterationint5Bands processed per iteration (memory tuning)
use_64_bit_hashboolFalseUse 64-bit hash (more memory, fewer collisions)
seedint42Random seed for MinHash permutations
input_filetypestr”parquet”Input file format (“parquet” or “jsonl”)
input_blocksizestrint”1GiB”
perform_removalboolFalseReserved; must remain False. Fuzzy removal is performed with TextDuplicatesRemovalWorkflow.

Similarity Threshold

Control matching strictness with num_bands and minhashes_per_band:

  • More strict matching: Increase num_bands or decrease minhashes_per_band
  • Less strict matching: Decrease num_bands or increase minhashes_per_band

Default (num_bands=20, minhashes_per_band=13) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.

1# Example: stricter matching (fewer pairs detected, higher required similarity)
2fuzzy_workflow = FuzzyDeduplicationWorkflow(
3 num_bands=25, # More bands = stricter matching
4 minhashes_per_band=10 # Fewer hashes per band = stricter matching
5)
6
7# Example: less strict matching (more pairs detected, lower required similarity)
8fuzzy_workflow = FuzzyDeduplicationWorkflow(
9 num_bands=15, # Fewer bands = less strict matching
10 minhashes_per_band=15 # More hashes per band = less strict matching
11)

Removing Duplicates

After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:

1from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
2
3removal_workflow = TextDuplicatesRemovalWorkflow(
4 input_path="/path/to/input/data",
5 ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
6 output_path="/path/to/deduplicated",
7 input_filetype="parquet",
8 input_id_field="_curator_dedup_id",
9 ids_to_remove_duplicate_id_field="_curator_dedup_id",
10 id_generator_path="/path/to/output/fuzzy_id_generator.json" # Required if IDs were auto-assigned
11)
12removal_workflow.run()

When IDs were auto-assigned:

  • id_generator_path is required
  • Ensures consistent ID mapping between identification and removal stages

Output Format

The fuzzy deduplication process produces the following directory structure:

1cache_path/
2├── MinHashStage/ # MinHash signatures
3│ └── *.parquet
4├── LSHStage/ # LSH buckets
5│ └── *.parquet
6├── BucketsToEdges/ # Graph edges
7│ └── *.parquet
8└── ConnectedComponents/ # Connected components
9 └── *.parquet
10
11output_path/
12├── FuzzyDuplicateIds/ # Duplicate identification results
13│ └── *.parquet # Parquet files with document IDs to remove
14└── fuzzy_id_generator.json # ID generator mapping (if IDs were auto-assigned)

File Formats

The workflow produces these output files:

  1. Duplicate IDs (FuzzyDuplicateIds/*.parquet):

    • Contains document IDs to remove
    • Format: Parquet files with column: ["_curator_dedup_id"]
    • Important: Contains only the IDs of documents to remove, not the full document content
  2. ID Generator (fuzzy_id_generator.json):

    • JSON file containing ID generator state
    • Required for removal workflow when IDs were auto-assigned
    • Ensures consistent ID mapping across workflow stages
  3. Cache Files (cache_path/):

    • Intermediate results for debugging and analysis
    • Can be reused if re-running with different parameters
    • Clear cache between runs if parameters change significantly

Performance characteristics:

  • GPU-accelerated MinHash and LSH operations
  • Scales across multiple GPUs and nodes using Ray
  • bands_per_iteration controls memory usage
  • Intermediate results are cached for efficiency

GPU requirements:

  • NVIDIA GPU with CUDA support
  • Ray cluster with GPU workers

Performance tuning:

  • Memory: Adjust bands_per_iteration (lower = less memory, more iterations)
  • Accuracy: Use char_ngrams >= 20 to reduce false positives
  • Best practices: Clear cache between runs, use input_blocksize="1GiB"

Note: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as bands_per_iteration, char_ngrams, and input_blocksize.

For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the Deduplication overview.