Semantic Deduplication#

Detect and remove semantically redundant data from your large text datasets using NeMo Curator.

Unlike exact or fuzzy deduplication, which focus on textual similarity, semantic deduplication leverages the meaning of content to identify duplicates. This approach can significantly reduce dataset size while maintaining or even improving model performance.

The technique uses embeddings to identify “semantic duplicates” - content pairs that convey similar meaning despite using different words.

Note

GPU Acceleration: Semantic deduplication requires GPU acceleration for both embedding generation and clustering operations. This method uses cuDF for GPU-accelerated dataframe operations and PyTorch models on GPU for optimal performance.

How It Works#

Semantic deduplication identifies meaning-based duplicates using embeddings:

Generates embeddings for each document using transformer models
Clusters embeddings using K-means
Computes pairwise cosine similarities within clusters
Identifies semantic duplicates based on similarity threshold
Removes duplicates, keeping one representative per group

Note

Based on SemDeDup: Data-efficient learning at web-scale through semantic deduplication by Abbas et al.

Before You Start#

Prerequisites:

GPU acceleration (required for embedding generation and clustering)
Stable document identifiers for removal (either existing IDs or IDs managed by the workflow and removal stages)

Quick Start#

Get started with semantic deduplication using these examples:

One-Step Process

Complete deduplication in a single step:

from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow
from nemo_curator.backends.experimental.ray_data import RayDataExecutor

workflow = TextSemanticDeduplicationWorkflow(
    input_path="input_data/",
    output_path="./results", 
    cache_path="./sem_cache",
    model_identifier="sentence-transformers/all-MiniLM-L6-v2",
    n_clusters=100,
    eps=0.07,  # Similarity threshold
    id_field="doc_id",
    perform_removal=True  # Complete deduplication
)

executor = RayDataExecutor()
results = workflow.run(executor)
# Clean dataset saved to ./results/deduplicated/

Two-Step Process

Identify duplicates first, then remove them:

from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow
from nemo_curator.backends.experimental.ray_data import RayDataExecutor

# Step 1: Identify duplicates
workflow = TextSemanticDeduplicationWorkflow(
    input_path="input_data/",
    output_path="./results",
    cache_path="./sem_cache",
    model_identifier="sentence-transformers/all-MiniLM-L6-v2",
    n_clusters=100,
    eps=0.07,
    id_field="doc_id",
    perform_removal=False  # Only identify duplicates
)

executor = RayDataExecutor()
results = workflow.run(executor)
# Duplicate IDs saved to ./results/duplicates/

# Step 2: Remove duplicates (if needed)
# Use TextDuplicatesRemovalWorkflow with duplicate IDs

Minimal Example

from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow
from nemo_curator.backends.experimental.ray_data import RayDataExecutor

workflow = TextSemanticDeduplicationWorkflow(
    input_path="input_data/",
    output_path="./results",
    cache_path="./sem_cache",
    perform_removal=True
)

executor = RayDataExecutor()
results = workflow.run(executor)

Configuration#

Configure semantic deduplication using these key parameters:

Comparison with Other Deduplication Methods#

Compare semantic deduplication with other methods:

Table 14 Deduplication Method Behavior Comparison#
Method	Return Value Options	perform_removal Parameter	Workflow
ExactDuplicates	Duplicates (ID list only)	❌ Not supported (must remain `False`; use `TextDuplicatesRemovalWorkflow`)	Two-step (identification + removal workflow)
FuzzyDuplicates	Duplicates (ID list only)	❌ Not supported (must remain `False`; use `TextDuplicatesRemovalWorkflow`)	Two-step (identification + removal workflow)
TextSemanticDeduplicationWorkflow	Duplicates or Clean Dataset	✅ Available	One-step or two-step

Key Parameters#

Table 15 Key Configuration Parameters#
Parameter	Type	Default	Description
`model_identifier`	str	“sentence-transformers/all-MiniLM-L6-v2”	Pre-trained model for embedding generation
`embedding_model_inference_batch_size`	int	256	Number of samples per embedding batch
`n_clusters`	int	100	Number of clusters for k-means clustering
`kmeans_max_iter`	int	300	Maximum iterations for clustering
`eps`	float	0.01	Threshold for deduplication (higher = more aggressive)
`which_to_keep`	str	“hard”	Strategy for keeping duplicates (“hard”/”easy”/”random”)
`pairwise_batch_size`	int	1024	Batch size for similarity computation
`distance_metric`	str	“cosine”	Distance metric for similarity (“cosine” or “l2”)
`embedding_pooling`	str	“mean_pooling”	Pooling strategy (“mean_pooling” or “last_token”)
`perform_removal`	bool	true	Whether to perform duplicate removal
`text_field`	str	“text”	Name of the text field in input data
`id_field`	str	“_curator_dedup_id”	Name of the ID field in the data

Similarity Threshold#

Control deduplication aggressiveness with eps:

Lower values (e.g., 0.001): More strict, less deduplication, higher confidence
Higher values (e.g., 0.1): Less strict, more aggressive deduplication

Experiment with different values to balance data reduction and dataset diversity.

Output Format#

The semantic deduplication process produces the following directory structure in your configured cache_path:

cache_path/
├── embeddings/                           # Embedding outputs
│   └── *.parquet                         # Parquet files containing document embeddings
├── semantic_dedup/                       # Semantic deduplication cache
│   ├── kmeans_results/                   # K-means clustering outputs
│   │   ├── kmeans_centroids.npy         # Cluster centroids
│   │   └── embs_by_nearest_center/      # Embeddings organized by cluster
│   │       └── nearest_cent={0..n-1}/   # Subdirectories for each cluster
│   │           └── *.parquet            # Cluster member embeddings
│   └── pairwise_results/                # Pairwise similarity results
│       └── *.parquet                    # Similarity scores by cluster
└── output_path/
    ├── duplicates/                       # Duplicate identification results
    │   └── *.parquet                    # Document IDs to remove
    └── deduplicated/                     # Final clean dataset (if perform_removal=True)
        └── *.parquet                    # Deduplicated documents

File Formats#

The workflow produces these output files:

Document Embeddings (embeddings/*.parquet):
- Contains document IDs and their vector embeddings
- Format: Parquet files with columns: [id_column, embedding_column]
Cluster Assignments (semantic_dedup/kmeans_results/):
- kmeans_centroids.npy: NumPy array of cluster centers
- embs_by_nearest_center/: Parquet files containing cluster members
- Format: Parquet files with columns: [id_column, embedding_column, cluster_id]
Deduplicated Results (output_path/duplicates/*.parquet):
- Final output containing document IDs to remove after deduplication
- Format: Parquet file with columns: ["id"]
- Important: Contains only the IDs of documents to remove, not the full document content
- When perform_removal=True, clean dataset is saved to output_path/deduplicated/

Performance Considerations

Performance characteristics:

Computationally intensive, especially for large datasets
GPU acceleration required for embedding generation and clustering
Benefits often outweigh upfront cost (reduced training time, improved model performance)

GPU requirements:

NVIDIA GPU with CUDA support
Sufficient GPU memory (recommended: >8GB for medium datasets)
RAPIDS libraries (cuDF) for GPU-accelerated dataframe operations
CPU-only processing not supported

Performance tuning:

Adjust n_clusters based on dataset size and available resources
Use batched cosine similarity to reduce memory requirements
Consider distributed processing for very large datasets

Table 16 Performance Scaling (Example)#
Dataset Size	GPU Memory	Processing Time	Recommended GPUs
<100K docs	4-8 GB	1-2 hours	RTX 3080, A100
100K-1M docs	8-16 GB	2-8 hours	RTX 4090, A100
1M docs	16 GB	8+ hours	A100, H100

For more details, see the SemDeDup paper by Abbas et al.