Hash-Based Duplicate Removal#

Remove duplicate and near-duplicate documents from your text datasets using NeMo Curator’s hash-based deduplication modules with optional GPU acceleration.

How It Works#

These modules use hash-based algorithms to efficiently process large datasets and support two primary methods: Exact and Fuzzy Duplicate removal. Fuzzy deduplication leverages RAPIDS for GPU acceleration.

Table 6 Hash-Based Duplicate Removal Methods#
Method	Exact Duplicate Removal	Fuzzy Duplicate Removal	GPU Required
Purpose	Removes identical documents	Removes similar documents based on content
Process	Hash document content Keep one document per unique hash Works on CPU or GPU (GPU recommended)	Compute MinHash signatures Group via LSH buckets Optional similarity verification Keep one doc per similar group	Optional / Required
Best For	Finding exact copies	Finding near-duplicates and variants

Removing duplicates improves language model training by preventing overrepresentation of repeated content. For more information, see research by Muennighoff et al. (2023) and Tirumala et al. (2023).

Understanding Operational Modes#

Both ExactDuplicates and FuzzyDuplicates support two operational modes controlled by the perform_removal parameter:

Table 7 Operational Modes#
Mode	`perform_removal=False` (Default)	`perform_removal=True`
Return Value	Dataset with duplicate IDs/groups	Deduplicated dataset
Workflow	Call `module(dataset)` or `module.identify_duplicates(dataset)` Call `module.remove(dataset, duplicates)`	Call `module(dataset)` Returns final deduplicated dataset
Use Case	When you want to inspect duplicates first	When you want direct deduplication

Important Notes:

Exact deduplication: Returns documents with _hashes field when perform_removal=False
Fuzzy deduplication: Returns documents with group field when perform_removal=False
Always check if the result is None (no duplicates found) before calling .remove()

Usage#

Exact Duplicate Removal#

Python

from nemo_curator import ExactDuplicates, AddId
from nemo_curator.datasets import DocumentDataset

# Add unique IDs if needed
add_id = AddId(id_field="my_id", id_prefix="doc_prefix")
dataset = DocumentDataset.read_json("input_file_path")
id_dataset = add_id(dataset)

# Set up duplicate removal
exact_duplicates = ExactDuplicates(
  id_field="my_id",
  text_field="text",
  hash_method="md5",  # Currently only "md5" is supported
  perform_removal=True,  # If True, returns deduplicated dataset; if False, returns duplicate IDs
  cache_dir="/path/to/dedup_outputs",
)

# Process the dataset
dataset = DocumentDataset.read_parquet(
    input_files="/path/to/parquet/data",
    backend="cudf",  # or "pandas" for CPU
)
deduplicated_dataset = exact_duplicates(dataset)

# Alternative workflow when perform_removal=False:
# exact_duplicates = ExactDuplicates(
#     id_field="my_id",
#     text_field="text", 
#     hash_method="md5",
#     perform_removal=False,  # Returns duplicate IDs only
#     cache_dir="/path/to/dedup_outputs",
# )
# duplicates = exact_duplicates(dataset)  # Get duplicate IDs
# if duplicates is not None:
#     deduplicated_dataset = exact_duplicates.remove(dataset, duplicates)  # Remove duplicates
# else:
#     print("No duplicates found")
#     deduplicated_dataset = dataset

For a complete example, see examples/exact_deduplication.py.

CLI

# Add IDs if needed
add_id \
  --id-field-name="my_id" \
  --input-data-dir=/path/to/data \
  --id-prefix="doc_prefix"

# Remove exact duplicates
gpu_exact_dups \
  --input-data-dirs /path/to/jsonl/dir1 /path/to/jsonl/dir2 \
  --output-dir /path/to/output_dir \
  --input-json-text-field text \
  --input-json-id-field my_id \
  --log-dir ./

The CLI utilities only work with JSONL datasets and GPU-based backends. For other formats, use the Python API.

Fuzzy Duplicate Removal#

Python

from nemo_curator import FuzzyDuplicates, FuzzyDuplicatesConfig
from nemo_curator.datasets import DocumentDataset

# Configure the duplicate removal
config = FuzzyDuplicatesConfig(
    cache_dir="/path/to/dedup_outputs",
    id_field="my_id",
    text_field="text",
    perform_removal=True,  # If True, returns deduplicated dataset; if False, returns duplicate IDs
    seed=42,
    char_ngrams=24,
    num_buckets=20,
    hashes_per_bucket=13,
    use_64_bit_hash=False,  # Set to True for 64-bit hashes
    false_positive_check=False,  # Set to True for higher accuracy but slower processing
)

# Initialize and run
fuzzy_duplicates = FuzzyDuplicates(
    config=config,
    logger="./",  # Optional: path to log directory or existing logger
)
dataset = DocumentDataset.read_json(
    input_files="/path/to/jsonl/data",
    backend="cudf",  # Fuzzy deduplication requires cuDF backend
)
deduplicated_dataset = fuzzy_duplicates(dataset)

# Alternative workflow when perform_removal=False:
# config.perform_removal = False
# fuzzy_duplicates = FuzzyDuplicates(config=config)
# duplicates = fuzzy_duplicates.identify_duplicates(dataset)  # Get duplicate groups
# if duplicates is not None:
#     deduplicated_dataset = fuzzy_duplicates.remove(dataset, duplicates)  # Remove duplicates
# else:
#     print("No duplicates found")
#     deduplicated_dataset = dataset

For best performance:

Set false_positive_check=False for faster processing (may have ~5% false positives)
The default parameters target approximately 0.8 Jaccard similarity
Use buckets_per_shuffle=1 for memory-constrained environments
Clear the cache directory between runs to avoid conflicts
Use GPU backend (backend="cudf") for optimal performance

For a complete example, see examples/fuzzy_deduplication.py.

CLI

Fuzzy duplicate removal via the CLI involves several sequential steps:

# 1. Compute MinHash signatures
gpu_compute_minhashes \
  --input-data-dirs /path/to/jsonl/dir \
  --output-minhash-dir /path/to/output_minhashes \
  --input-json-text-field text \
  --input-json-id-field my_id \
  --minhash-length 256 \
  --char-ngram 24 \
  --seed 42

# 2. Generate LSH buckets
minhash_buckets \
  --input-data-dirs /path/to/output_minhashes \
  --output-bucket-dir /path/to/dedup_output \
  --input-minhash-field _minhash_signature \
  --input-json-id-field my_id \
  --num-bands 20

# 3. Generate edges from buckets
buckets_to_edges \
  --input-bucket-dir /path/to/dedup_output/_buckets.parquet \
  --output-dir /path/to/dedup_output \
  --input-json-id-field my_id

# 4. Find connected components
gpu_connected_component \
  --jaccard-pairs-path /path/to/dedup_output/_edges.parquet \
  --output-dir /path/to/dedup_output \
  --cache-dir /path/to/cc_cache \
  --input-json-id-field my_id

For more advanced configurations including similarity verification, refer to the full documentation.

Incremental Processing#

For new data additions, you don’t need to reprocess existing documents:

Organize new data in separate directories
Compute MinHash signatures only for new data
Run subsequent steps on all data (existing and new MinHash signatures)

gpu_compute_minhashes \
  --input-data-dirs /input/new_data \
  --output-minhash-dir /output/ \
  --input-json-text-field text \
  --input-json-id-field my_id \
  --minhash-length 256 \
  --char-ngram 24

Then proceed with the remaining steps as usual on the combined MinHash directories.

Performance and GPU Requirements#

GPU Acceleration Overview#

Exact Deduplication:
- Backend Support: Both CPU (pandas) and GPU (cudf)
- GPU Benefits: Significant speedup for large datasets through optimized hashing
- Recommendation: Use GPU for datasets with >1M documents
Fuzzy Deduplication:
- Backend Support: GPU only (cudf required)
- GPU Benefits: Essential for MinHash and LSH operations
- Memory: Requires sufficient GPU memory for dataset processing

Performance Characteristics#

Table 8 Performance Comparison#
Method	Small Datasets (<100K docs)	Medium Datasets (100K-1M docs)	Large Datasets (>1M docs)
Exact (CPU)	Fast	Moderate	Slow
Exact (GPU)	Fast	Fast	Fast
Fuzzy (GPU)	Fast	Fast	Fast

Hardware Recommendations#

CPU-only environments: Use exact deduplication with backend="pandas"
GPU environments: Use both exact and fuzzy deduplication with backend="cudf"
Memory considerations: GPU memory should be >2x the dataset size in memory
Distributed processing: Use Dask for datasets that exceed single GPU memory

Error Handling and Validation#

When working with deduplication modules, consider these common scenarios:

# Check for empty results
duplicates = exact_duplicates.identify_duplicates(dataset)
if duplicates is None or len(duplicates) == 0:
    print("No duplicates found")
    return dataset

# Validate backend compatibility
try:
    # Fuzzy deduplication requires cuDF backend
    fuzzy_duplicates = FuzzyDuplicates(config=config)
    result = fuzzy_duplicates(dataset)
except ValueError as e:
    print(f"Backend error: {e}")
    # Convert to cuDF backend if needed
    dataset = dataset.to_backend("cudf")
    result = fuzzy_duplicates(dataset)

# Handle cache directory issues
import os
if os.path.exists(config.cache_dir):
    print(f"Warning: Cache directory {config.cache_dir} exists and will be reused")
    # Clear if needed: shutil.rmtree(config.cache_dir)