Semantic Deduplication
Detect and remove semantically redundant data from your large text datasets using NeMo Curator.
Unlike exact or fuzzy deduplication, which focus on textual similarity, semantic deduplication leverages the meaning of content to identify duplicates. This approach can significantly reduce dataset size while maintaining or even improving model performance.
The technique uses embeddings to identify “semantic duplicates” - content pairs that convey similar meaning despite using different words.
GPU Acceleration: Semantic deduplication requires GPU acceleration for both embedding generation and clustering operations. This method uses cuDF for GPU-accelerated dataframe operations and PyTorch models on GPU for optimal performance.
How It Works
Semantic deduplication identifies meaning-based duplicates using embeddings:
- Generates embeddings for each document using transformer models
- Clusters embeddings using K-means
- Computes pairwise cosine similarities within clusters
- Identifies semantic duplicates based on similarity threshold
- Removes duplicates, keeping one representative per group
Based on SemDeDup: Data-efficient learning at web-scale through semantic deduplication by Abbas et al.
Before You Start
Prerequisites:
- GPU acceleration (required for embedding generation and clustering)
- Stable document identifiers for removal (either existing IDs or IDs managed by the workflow and removal stages)
Running in Docker: When running semantic deduplication inside the NeMo Curator container, ensure the container is started with --gpus all so that CUDA GPUs are available. Without this flag, you will see RuntimeError: No CUDA GPUs are available. Also activate the virtual environment with source /opt/venv/env.sh after entering the container.
Quick Start
Get started with semantic deduplication using the following example of identifying duplicates, then remove them in one step:
Configuration
Configure semantic deduplication using these key parameters:
Step-by-Step Workflow
For fine-grained control, break semantic deduplication into separate stages:
This approach enables analysis of intermediate results and parameter tuning.
Comparison with Other Deduplication Methods
Compare semantic deduplication with other methods:
Key Parameters
Similarity Threshold
Control deduplication aggressiveness with eps:
- Lower values (e.g., 0.001): More strict, less deduplication, higher confidence
- Higher values (e.g., 0.1): Less strict, more aggressive deduplication
Experiment with different values to balance data reduction and dataset diversity.
Embedding Models
Sentence Transformers (recommended for text):
HuggingFace Models:
When choosing a model:
- Ensure compatibility with your data type
- Adjust
embedding_model_inference_batch_sizefor memory requirements - Choose models appropriate for your language or domain
- Avoid generic decoder-only LLMs (e.g., OPT/GPT) for embeddings; prefer models trained for sentence embeddings (e.g., E5/BGE/SBERT)
Advanced Configuration
Output Format
The semantic deduplication process produces the following directory structure in your configured cache_path:
File Formats
The workflow produces these output files:
-
Document Embeddings (
embeddings/*.parquet):- Contains document IDs and their vector embeddings
- Format: Parquet files with columns:
[id_column, embedding_column]
-
Cluster Assignments (
semantic_dedup/kmeans_results/):kmeans_centroids.npy: NumPy array of cluster centersembs_by_nearest_center/: Parquet files containing cluster members- Format: Parquet files with columns:
[id_column, embedding_column, cluster_id]
-
Duplicate IDs (
output_path/duplicates/*.parquet):-
IDs of documents identified as duplicates for removal
-
Format: Parquet file with columns:
["id"] -
Important: Contains only the IDs of documents to remove, not the full document content
-
When
perform_removal=True, clean dataset is saved tooutput_path/deduplicated/
-
Performance Considerations
Performance characteristics:
- Computationally intensive, especially for large datasets
- GPU acceleration required for embedding generation and clustering
- Benefits often outweigh upfront cost (reduced training time, improved model performance)
GPU requirements:
- NVIDIA GPU with CUDA support
- Sufficient GPU memory (recommended: >8GB for medium datasets)
- RAPIDS libraries (cuDF) for GPU-accelerated dataframe operations
- CPU-only processing not supported
Performance tuning:
- Adjust
n_clustersbased on dataset size and available resources - Use batched cosine similarity to reduce memory requirements
- Consider distributed processing for very large datasets
For more details, see the SemDeDup paper by Abbas et al.
Advanced Configuration
ID Generator for large-scale operations:
Critical requirements:
- Use the same input configuration (file paths, partitioning) across all stages
- ID consistency maintained by hashing filenames in each task
- Mismatched partitioning causes ID lookup failures
Ray backend configuration:
Provides distributed processing, memory management, and fault tolerance.