Detect and remove semantically redundant data from your large text datasets using NeMo Curator.
Unlike exact or fuzzy deduplication, which focus on textual similarity, semantic deduplication leverages the meaning of content to identify duplicates. This approach can significantly reduce dataset size while maintaining or even improving model performance.
The technique uses embeddings to identify “semantic duplicates” - content pairs that convey similar meaning despite using different words.
GPU Acceleration: Semantic deduplication requires GPU acceleration for both embedding generation and clustering operations. This method uses cuDF for GPU-accelerated dataframe operations and PyTorch models on GPU for optimal performance.
Semantic deduplication identifies meaning-based duplicates using embeddings:
Based on SemDeDup: Data-efficient learning at web-scale through semantic deduplication by Abbas et al.
Prerequisites:
Running in Docker: When running semantic deduplication inside the NeMo Curator container, ensure the container is started with --gpus all so that CUDA GPUs are available. Without this flag, you will see RuntimeError: No CUDA GPUs are available. Also activate the virtual environment with source /opt/venv/env.sh after entering the container.
Get started with semantic deduplication using the following example of identifying duplicates, then remove them in one step:
Configure semantic deduplication using these key parameters:
For fine-grained control, break semantic deduplication into separate stages:
This approach enables analysis of intermediate results and parameter tuning.
Compare semantic deduplication with other methods:
Control deduplication aggressiveness with eps:
Experiment with different values to balance data reduction and dataset diversity.
Embedding generation uses vLLM as the inference backend. The default model is google/embeddinggemma-300m.
Default (vLLM):
Custom model with vLLM options:
vLLM Embedder (recommended for large models):
For large embedding models, you can generate embeddings separately using VLLMEmbeddingModelStage before running the deduplication workflow. This provides better GPU utilization and throughput for models with 500M+ parameters. See vLLM Embedder for details.
Generate embeddings with VLLMEmbeddingModelStage using the vLLM Embedder pipeline, then pass the output to SemanticDeduplicationWorkflow:
When choosing a model:
embedding_pretokenize=True for models that benefit from explicit tokenization controlembedding_vllm_init_kwargsThe semantic deduplication process produces the following directory structure in your configured cache_path:
The workflow produces these output files:
Document Embeddings (embeddings/*.parquet):
[id_column, embedding_column]Cluster Assignments (semantic_dedup/kmeans_results/):
kmeans_centroids.npy: NumPy array of cluster centersembs_by_nearest_center/: Parquet files containing cluster members[id_column, embedding_column, cluster_id]Duplicate IDs (output_path/duplicates/*.parquet):
IDs of documents identified as duplicates for removal
Format: Parquet file with columns: ["id"]
Important: Contains only the IDs of documents to remove, not the full document content
When perform_removal=True, clean dataset is saved to output_path/deduplicated/
Performance characteristics:
GPU requirements:
Performance tuning:
n_clusters based on dataset size and available resourcesFor more details, see the SemDeDup paper by Abbas et al.
ID Generator for large-scale operations:
Critical requirements:
Ray backend configuration:
Provides distributed processing, memory management, and fault tolerance.