Fuzzy Duplicate Removal#

Find near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.

For other approaches, refer to Deduplication.

How It Works#

File partitioning for scalable, distributed processing
MinHash signatures over character n-grams
LSH banding to find candidate matches
Graph construction and connected components
Select one document per duplicate group and emit IDs to remove

Usage#

from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow

# Basic fuzzy duplicate identification
fuzzy_workflow = FuzzyDeduplicationWorkflow(
    input_path="/path/to/input/data",
    cache_path="/path/to/cache",
    output_path="/path/to/output",
    text_field="text",
    perform_removal=False,  # Identification only
    input_blocksize="1GiB",  # Default block size for fuzzy dedup
    # MinHash parameters
    seed=42,
    char_ngrams=24,  # Character n-gram size for MinHash
    # LSH parameters
    num_bands=20,
    minhashes_per_band=13,
    use_64_bit_hash=False,
    # Performance tuning
    bands_per_iteration=5,
)

fuzzy_workflow.run()

# Advanced configuration (I/O and storage options)
fuzzy_workflow_advanced = FuzzyDeduplicationWorkflow(
    input_path="/path/to/input/data",
    cache_path="/path/to/cache",
    output_path="/path/to/output",
    input_filetype="parquet",   # "parquet" or "jsonl"
    input_blocksize="1GiB",
    input_file_extensions=[".parquet"],
    read_kwargs={"storage_options": {"key": "<access_key>", "secret": "<secret_key>"}},
    cache_kwargs={"storage_options": {"key": "<access_key>", "secret": "<secret_key>"}},
    write_kwargs={"storage_options": {"key": "<access_key>", "secret": "<secret_key>"}},
    text_field="content",
    perform_removal=False,
    seed=123,
    char_ngrams=20,
    num_bands=25,
    minhashes_per_band=10,
    use_64_bit_hash=True,
    bands_per_iteration=3,
    env_vars={"CUDA_VISIBLE_DEVICES": "0,1,2,3"},
)

fuzzy_workflow_advanced.run()

Note

Removal is currently not implemented in the fuzzy workflow (perform_removal=True raises an error). Use the duplicate ID outputs with the Text Duplicates Removal workflow. Refer to Common Operations for removal and outputs.

Performance Recommendations#

Use char_ngrams >= 20 to reduce false positives
Adjust bands_per_iteration based on available GPU memory
Requires a Ray-based distributed GPU execution environment
Clear the cache and output directories between runs to avoid conflicts

Output Structure#

Cache directory:
- MinHashStage/, LSHStage/, BucketsToEdges/, ConnectedComponents/
Output directory:
- FuzzyDuplicateIds/: Parquet files with document IDs to remove
- fuzzy_id_generator.json: ID generator mapping

Workflow Stages#

File Partitioning
MinHash
LSH
Buckets to Edges
Connected Components
Identify Duplicates