Deduplication#

Remove duplicate and near-duplicate documents efficiently from your text datasets using NeMo Curator’s GPU-accelerated and semantic deduplication modules.

Removing duplicates improves language model training by preventing overrepresentation of repeated content. NeMo Curator provides multiple approaches to deduplication, from exact hash-based matching to semantic similarity detection using embeddings.

How It Works#

NeMo Curator offers three main approaches to deduplication:

  1. Exact Deduplication: Uses document hashing to identify identical content

  2. Fuzzy Deduplication: Uses MinHash and LSH to find near-duplicate content

  3. Semantic Deduplication: Uses embeddings to identify semantically similar content

Each approach serves different use cases and offers different trade-offs between speed, accuracy, and the types of duplicates detected.


Deduplication Methods#

Hash-Based Deduplication

Remove exact and fuzzy duplicates using hashing algorithms

Hash-Based Duplicate Removal
Semantic Deduplication

Remove semantically similar documents using embeddings

Semantic Deduplication

Usage#

Here’s a quick comparison of the different deduplication approaches:

Table 5 Deduplication Method Comparison#

Method

Best For

Speed

Duplicate Types Detected

GPU Required

Exact Deduplication

Identical copies

Very Fast

Character-for-character matches

Optional

Fuzzy Deduplication

Near-duplicates with small changes

Fast

Content with minor edits, reformatting

Required

Semantic Deduplication

Similar meaning, different words

Moderate

Paraphrases, translations, rewrites

Required

Quick Start Example#

from nemo_curator import ExactDuplicates, FuzzyDuplicates, SemDedup
from nemo_curator.datasets import DocumentDataset

# Load your dataset
# Note: Use "cudf" backend for GPU acceleration, "pandas" for CPU
dataset = DocumentDataset.read_json("input_data/*.jsonl", backend="cudf")

# Option 1: Exact deduplication (CPU/GPU flexible)
exact_dedup = ExactDuplicates(
    id_field="doc_id",
    text_field="text",
    perform_removal=True
)
# Works with both "cudf" (GPU) and "pandas" (CPU) backends
deduplicated = exact_dedup(dataset)

# Option 2: Fuzzy deduplication (requires GPU)
from nemo_curator import FuzzyDuplicatesConfig
fuzzy_config = FuzzyDuplicatesConfig(
    cache_dir="./fuzzy_cache",
    id_field="doc_id", 
    text_field="text",
    perform_removal=True
)
fuzzy_dedup = FuzzyDuplicates(config=fuzzy_config)
# Requires cudf backend (GPU)
deduplicated = fuzzy_dedup(dataset)

# Option 3: Semantic deduplication (requires GPU)
from nemo_curator import SemDedupConfig
sem_config = SemDedupConfig(
    cache_dir="./sem_cache",
    embedding_model_name_or_path="sentence-transformers/all-MiniLM-L6-v2"
)
sem_dedup = SemDedup(config=sem_config, id_column="doc_id", perform_removal=True)
# Requires cudf backend (GPU)
deduplicated = sem_dedup(dataset)

Performance Considerations#

GPU Acceleration#

  • Exact deduplication: Supports both CPU and GPU backends. GPU provides significant speedup for large datasets through optimized hashing operations

  • Fuzzy deduplication: Requires GPU backend for MinHash and LSH operations. GPU acceleration is essential for processing large datasets efficiently

  • Semantic deduplication: Requires GPU backend for embedding generation and clustering operations. GPU acceleration is critical for feasible processing times

Hardware Requirements#

  • CPU-only workflows: Only exact deduplication is available

  • GPU workflows: All three methods available. Recommended for large-scale data processing

  • Memory considerations: GPU memory requirements scale with dataset size and embedding dimensions

For very large datasets (TB-scale), consider running deduplication on distributed GPU clusters.