Curate TextProcess DataDeduplication

Semantic Deduplication

View as Markdown

Detect and remove semantically redundant data from your large text datasets using NeMo Curator.

Unlike exact or fuzzy deduplication, which focus on textual similarity, semantic deduplication leverages the meaning of content to identify duplicates. This approach can significantly reduce dataset size while maintaining or even improving model performance.

The technique uses embeddings to identify “semantic duplicates” - content pairs that convey similar meaning despite using different words.

GPU Acceleration: Semantic deduplication requires GPU acceleration for both embedding generation and clustering operations. This method uses cuDF for GPU-accelerated dataframe operations and PyTorch models on GPU for optimal performance.

How It Works

Semantic deduplication identifies meaning-based duplicates using embeddings:

  1. Generates embeddings for each document using transformer models
  2. Clusters embeddings using K-means
  3. Computes pairwise cosine similarities within clusters
  4. Identifies semantic duplicates based on similarity threshold
  5. Removes duplicates, keeping one representative per group

Before You Start

Prerequisites:

  • GPU acceleration (required for embedding generation and clustering)
  • Stable document identifiers for removal (either existing IDs or IDs managed by the workflow and removal stages)

Running in Docker: When running semantic deduplication inside the NeMo Curator container, ensure the container is started with --gpus all so that CUDA GPUs are available. Without this flag, you will see RuntimeError: No CUDA GPUs are available. Also activate the virtual environment with source /opt/venv/env.sh after entering the container.

Quick Start

Get started with semantic deduplication using the following example of identifying duplicates, then remove them in one step:

1from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow
2
3# Default: uses vLLM with google/embeddinggemma-300m
4workflow = TextSemanticDeduplicationWorkflow(
5 input_path="input_data/",
6 output_path="./results",
7 cache_path="./sem_cache",
8 n_clusters=100,
9 eps=0.07, # Similarity threshold
10 id_field="doc_id",
11 perform_removal=True, # Complete deduplication
12)
13
14results = workflow.run()
15# Clean dataset saved to ./results/deduplicated/

Configuration

Configure semantic deduplication using these key parameters:

For fine-grained control, break semantic deduplication into separate stages:

1from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor
2from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
3from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
4from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
5from nemo_curator.pipeline import Pipeline
6from nemo_curator.stages.text.io.reader import ParquetReader
7from nemo_curator.stages.text.io.writer import ParquetWriter
8
9# Step 1: Create ID generator
10create_id_generator_actor()
11
12# Step 2: Generate embeddings separately (using vLLM)
13embedding_pipeline = Pipeline(
14 name="embedding_pipeline",
15 stages=[
16 ParquetReader(file_paths=input_path, files_per_partition=1, fields=["text"], _generate_ids=True),
17 # VLLMEmbeddingModelStage uses shorter parameter names than the workflow wrapper:
18 # pretokenize (not embedding_pretokenize), vllm_init_kwargs (not embedding_vllm_init_kwargs),
19 # max_chars (not embedding_max_chars), cache_dir (not model_cache_dir)
20 VLLMEmbeddingModelStage(
21 model_identifier="google/embeddinggemma-300m",
22 text_field="text",
23 ),
24 ParquetWriter(path=embedding_output_path, fields=["_curator_dedup_id", "embeddings"]),
25 ],
26)
27embedding_out = embedding_pipeline.run()
28
29# Step 3: Run clustering and pairwise similarity (without duplicate identification)
30semantic_workflow = SemanticDeduplicationWorkflow(
31 input_path=embedding_output_path,
32 output_path=semantic_workflow_path,
33 n_clusters=100,
34 id_field="_curator_dedup_id",
35 embedding_field="embeddings",
36 eps=None, # Skip duplicate identification for analysis
37)
38result = semantic_workflow.run()
39# result.metadata contains: total_time, num_duplicates, kmeans_time, pairwise_time
40
41# Step 4: Analyze similarity distribution to choose eps
42# Step 5: Identify duplicates with chosen eps
43# Step 6: Remove duplicates from original dataset

This approach enables analysis of intermediate results and parameter tuning.

Comparison with Other Deduplication Methods

Compare semantic deduplication with other methods:

MethodReturn Value Optionsperform_removal ParameterWorkflow
ExactDuplicatesDuplicates (ID list only)❌ Not supported (must remain False; use TextDuplicatesRemovalWorkflow)Two-step (identification + removal workflow)
FuzzyDuplicatesDuplicates (ID list only)❌ Not supported (must remain False; use TextDuplicatesRemovalWorkflow)Two-step (identification + removal workflow)
TextSemanticDeduplicationWorkflowDuplicates or Clean Dataset✅ AvailableOne-step or two-step

Key Parameters

ParameterTypeDefaultDescription
model_identifierstr"google/embeddinggemma-300m"Pre-trained model for embedding generation (vLLM backend)
embedding_pretokenizeboolFalseWhether to pre-tokenize input before passing to vLLM
embedding_vllm_init_kwargsdictNoneAdditional keyword arguments passed to the vLLM LLM initializer
embedding_max_charsintNoneMaximum number of characters for text truncation
model_cache_dirstrNoneDirectory to cache model weights
n_clustersint100Number of clusters for k-means clustering
kmeans_max_iterint300Maximum iterations for clustering
epsfloat0.01Threshold for deduplication (higher = more aggressive)
which_to_keepstr"hard"Strategy for keeping duplicates (“hard”, “easy”, or “random”)
pairwise_batch_sizeint1024Batch size for similarity computation
distance_metricstr"cosine"Distance metric for similarity (“cosine” or “l2”)
perform_removalboolTrueWhether to perform duplicate removal
text_fieldstr"text"Name of the text field in input data
id_fieldstr"_curator_dedup_id"Name of the ID field in the data

Similarity Threshold

Control deduplication aggressiveness with eps:

  • Lower values (such as 0.001): More strict, less deduplication, higher confidence
  • Higher values (such as 0.1): Less strict, more aggressive deduplication

Experiment with different values to balance data reduction and dataset diversity.

Embedding generation uses vLLM as the inference backend. The default model is google/embeddinggemma-300m.

Default (vLLM):

1workflow = TextSemanticDeduplicationWorkflow(
2 # Uses google/embeddinggemma-300m by default
3 input_path="input_data/",
4 output_path="./results",
5 cache_path="./sem_cache",
6)

Custom model with vLLM options:

1workflow = TextSemanticDeduplicationWorkflow(
2 model_identifier="google/embeddinggemma-300m",
3 embedding_pretokenize=True,
4 embedding_vllm_init_kwargs={"enforce_eager": True, "max_model_len": 2048},
5 # ... other parameters
6)

vLLM Embedder (recommended for large models):

For large embedding models, you can generate embeddings separately using VLLMEmbeddingModelStage before running the deduplication workflow. This provides better GPU utilization and throughput for models with 500M+ parameters. See vLLM Embedder for details.

Generate embeddings with VLLMEmbeddingModelStage using the vLLM Embedder pipeline, then pass the output to SemanticDeduplicationWorkflow:

1from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
2
3# After generating embeddings to embedding_output_path using VLLMEmbeddingModelStage
4semantic_workflow = SemanticDeduplicationWorkflow(
5 input_path=embedding_output_path,
6 output_path=output_path,
7 n_clusters=100,
8 eps=0.07,
9 id_field="_curator_dedup_id",
10 embedding_field="embeddings",
11)
12semantic_workflow.run()
13
14# Step 3: Filter original text dataset using the IDs to remove
15# See TextDuplicatesRemovalWorkflow for the removal step

When choosing a model:

  • Use models that support vLLM pooling (embedding) mode
  • Choose models appropriate for your language or domain
  • Prefer models trained for sentence embeddings (for example, EmbeddingGemma, E5, BGE, or SBERT)
  • Use embedding_pretokenize=True for models that benefit from explicit tokenization control
  • Pass additional vLLM configuration through embedding_vllm_init_kwargs
  • For more control over the embedding process, consider using VLLMEmbeddingModelStage separately
1workflow = TextSemanticDeduplicationWorkflow(
2 # I/O
3 input_path="input_data/",
4 output_path="results/",
5 cache_path="semdedup_cache",
6
7 # Embedding generation (vLLM backend)
8 text_field="text",
9 model_identifier="google/embeddinggemma-300m",
10 embedding_pretokenize=False,
11 embedding_max_chars=None,
12 model_cache_dir=None,
13
14 # Deduplication
15 n_clusters=100,
16 eps=0.01, # Similarity threshold
17 distance_metric="cosine",
18 which_to_keep="hard",
19
20 # K-means
21 kmeans_max_iter=300,
22 kmeans_tol=1e-4,
23 pairwise_batch_size=1024,
24
25 perform_removal=True,
26)

Output Format

The semantic deduplication process produces the following directory structure in your configured cache_path:

cache_path/
├── embeddings/ # Embedding outputs
│ └── *.parquet # Parquet files containing document embeddings
├── semantic_dedup/ # Semantic deduplication cache
│ ├── kmeans_results/ # K-means clustering outputs
│ │ ├── kmeans_centroids.npy # Cluster centroids
│ │ └── embs_by_nearest_center/ # Embeddings organized by cluster
│ │ └── nearest_cent={0..n-1}/ # Subdirectories for each cluster
│ │ └── *.parquet # Cluster member embeddings
│ └── pairwise_results/ # Pairwise similarity results
│ └── *.parquet # Similarity scores by cluster
└── output_path/
├── duplicates/ # Duplicate identification results
│ └── *.parquet # Document IDs to remove
└── deduplicated/ # Final clean dataset (if perform_removal=True)
└── *.parquet # Deduplicated documents

File Formats

The workflow produces these output files:

  1. Document Embeddings (embeddings/*.parquet):

    • Contains document IDs and their vector embeddings
    • Format: Parquet files with columns: [id_column, embedding_column]
  2. Cluster Assignments (semantic_dedup/kmeans_results/):

    • kmeans_centroids.npy: NumPy array of cluster centers
    • embs_by_nearest_center/: Parquet files containing cluster members
    • Format: Parquet files with columns: [id_column, embedding_column, cluster_id]
  3. Duplicate IDs (output_path/duplicates/*.parquet):

    • IDs of documents identified as duplicates for removal

    • Format: Parquet file with columns: ["id"]

    • Important: Contains only the IDs of documents to remove, not the full document content

    • When perform_removal=True, clean dataset is saved to output_path/deduplicated/

Performance characteristics:

  • Computationally intensive, especially for large datasets
  • GPU acceleration required for embedding generation and clustering
  • Benefits often outweigh upfront cost (reduced training time, improved model performance)

GPU requirements:

  • NVIDIA GPU with CUDA support
  • Sufficient GPU memory (recommended: >8GB for medium datasets)
  • RAPIDS libraries (cuDF) for GPU-accelerated dataframe operations
  • CPU-only processing not supported

Performance tuning:

  • Adjust n_clusters based on dataset size and available resources
  • Use batched cosine similarity to reduce memory requirements
  • Consider distributed processing for very large datasets
Dataset SizeGPU MemoryProcessing TimeRecommended GPUs
<100K docs4-8 GB1-2 hoursRTX 3080, A100
100K-1M docs8-16 GB2-8 hoursRTX 4090, A100
>1M docs>16 GB8+ hoursA100, H100

For more details, see the SemDeDup paper by Abbas et al.

ID Generator for large-scale operations:

1from nemo_curator.stages.deduplication.id_generator import (
2 create_id_generator_actor,
3 write_id_generator_to_disk,
4 kill_id_generator_actor
5)
6
7create_id_generator_actor()
8id_generator_path = "semantic_id_generator.json"
9write_id_generator_to_disk(id_generator_path)
10kill_id_generator_actor()
11
12# Use persisted ID generator in removal workflow
13removal_workflow = TextDuplicatesRemovalWorkflow(
14 input_path=input_path,
15 ids_to_remove_path=duplicates_path,
16 output_path=output_path,
17 id_generator_path=id_generator_path,
18 input_files_per_partition=1, # Match partitioning as embedding generation
19 # ... other parameters
20)

Critical requirements:

  • Use the same input configuration (file paths, partitioning) across all stages
  • ID consistency maintained by hashing filenames in each task
  • Mismatched partitioning causes ID lookup failures

Ray backend configuration:

1from nemo_curator.core.client import RayClient
2
3client = RayClient(
4 num_cpus=64, # Adjust based on available cores
5 num_gpus=4 # Should be roughly 2x the memory of embeddings
6)
7client.start()
8
9try:
10 workflow = TextSemanticDeduplicationWorkflow(
11 input_path=input_path,
12 output_path=output_path,
13 cache_path=cache_path,
14 # ... other parameters
15 )
16 result = workflow.run()
17 # result.metadata contains: total_time, num_duplicates, num_duplicates_removed, embedding_time, identification_time, removal_time, final_output_path
18finally:
19 client.stop()

Provides distributed processing, memory management, and fault tolerance.