Semantic Deduplication#

Detect and remove semantically redundant data from your large text datasets using NeMo Curator.

Unlike exact or fuzzy deduplication, which focus on textual similarity, semantic deduplication leverages the meaning of content to identify duplicates. This approach can significantly reduce dataset size while maintaining or even improving model performance.

Semantic deduplication is particularly effective for large, uncurated web-scale datasets, where it can remove up to 50% of the data with minimal performance impact. The technique uses embeddings to identify “semantic duplicates” - content pairs that convey similar meaning despite using different words.

Note

GPU Acceleration: Semantic deduplication requires GPU acceleration for both embedding generation and clustering operations. This method uses cuDF for GPU-accelerated dataframe operations and PyTorch models on GPU for optimal performance.

How It Works#

The SemDeDup algorithm consists of the following main steps:

Embedding Generation: Each document is embedded using a pre-trained model
Clustering: The embeddings are grouped into k clusters using k-means clustering
Similarity Computation: Within each cluster, pairwise cosine similarities are computed
Duplicate Identification: Document pairs with cosine similarity above a threshold are considered semantic duplicates
Duplicate Removal: From each group of semantic duplicates within a cluster, one representative document is kept (typically the one with the lowest cosine similarity to the cluster centroid) and the rest are removed

Note

NeMo Curator implements methods based on the paper SemDeDup: Data-efficient learning at web-scale through semantic deduplication by Abbas et al.

Before You Start#

Before running semantic deduplication, ensure that each document in your dataset has a unique identifier. You can use the AddId stage from NeMo Curator if needed:

from nemo_curator.stages.text.modules import AddId
from nemo_curator.pipeline import Pipeline

# Create pipeline with ID generation
pipeline = Pipeline(name="add_ids_for_dedup")

# Add ID generation stage
pipeline.add_stage(
    AddId(
        id_field="doc_id",
        id_prefix="corpus"  # Optional prefix for meaningful IDs
    )
)

For more details on using AddId, refer to the Adding Document IDs documentation.

TextSemanticDeduplicationWorkflow Interface#

The TextSemanticDeduplicationWorkflow class provides a comprehensive end-to-end interface for semantic deduplication in NeMo Curator:

Key Parameters#

input_path: Path(s) to input files containing text data
output_path: Directory to write deduplicated output
cache_path: Directory to cache intermediate results (embeddings, kmeans, pairwise, etc.)
perform_removal: Whether to perform duplicate removal (True) or just identify duplicates (False)
text_field: Name of the text field in input data (default: “text”)
id_field: Name of the ID field in the data
model_identifier: HuggingFace model identifier for embeddings
n_clusters: Number of clusters for K-means
eps: Epsilon value for duplicate identification

Usage Modes#

Mode 1: Two-step process (perform_removal=False)

# Step 1: Identify duplicates only
workflow = TextSemanticDeduplicationWorkflow(
    input_path="input_data/",
    output_path="results/",
    perform_removal=False,  # Only identify duplicates
    eps=0.01
)
results = workflow.run(executor)
# Duplicates are saved to output_path/duplicates/

Mode 2: One-step process (perform_removal=True)

# Returns deduplicated dataset directly
workflow = TextSemanticDeduplicationWorkflow(
    input_path="input_data/",
    output_path="results/",
    perform_removal=True,  # Complete deduplication
    eps=0.01
)
results = workflow.run(executor)
# Clean dataset saved to output_path/deduplicated/

Quick Start#

from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow
from nemo_curator.backends import RayDataExecutor

# Option 1: Two-step process (more control)
workflow = TextSemanticDeduplicationWorkflow(
    input_path="input_data/",
    output_path="./results",
    cache_path="./sem_cache",
    model_identifier="sentence-transformers/all-MiniLM-L6-v2",
    n_clusters=100,
    eps=0.07,  # Similarity threshold
    id_field="doc_id",
    perform_removal=False  # Only identify duplicates
)

# Run workflow
executor = RayDataExecutor()
results = workflow.run(executor)
# Duplicate IDs saved to ./results/duplicates/

# Option 2: One-step process (simpler)
workflow_simple = TextSemanticDeduplicationWorkflow(
    input_path="input_data/",
    output_path="./results", 
    cache_path="./sem_cache",
    model_identifier="sentence-transformers/all-MiniLM-L6-v2",
    n_clusters=100,
    eps=0.07,
    id_field="doc_id",
    perform_removal=True  # Complete deduplication
)

results = workflow_simple.run(executor)
# Clean dataset saved to ./results/deduplicated/

Configuration#

Semantic deduplication in NeMo Curator is configured through the TextSemanticDeduplicationWorkflow dataclass parameters. Here’s how to configure the workflow:

from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow

# Configure workflow with parameters
workflow = TextSemanticDeduplicationWorkflow(
    # Input/Output configuration
    input_path="input_data/",
    output_path="results/",
    cache_path="semdedup_cache",  # Directory for intermediate files
    perform_removal=True,
    
    # Embedding generation parameters
    text_field="text",
    embedding_field="embeddings",
    model_identifier="sentence-transformers/all-MiniLM-L6-v2",
    embedding_max_seq_length=512,
    embedding_pooling="mean_pooling",
    embedding_model_inference_batch_size=256,
    
    # Semantic deduplication parameters
    n_clusters=100,  # Number of clusters for K-means
    id_field="id",
    distance_metric="cosine",
    which_to_keep="hard",
    eps=0.01,  # Similarity threshold
    
    # K-means clustering parameters
    kmeans_max_iter=300,
    kmeans_tol=1e-4,
    kmeans_random_state=42,
    kmeans_init="k-means||",
    pairwise_batch_size=1024,
    
    # I/O parameters
    input_filetype="jsonl",
    output_filetype="parquet",
    verbose=True
)

You can customize these parameters to suit your specific needs and dataset characteristics.

Note

Configuration Parameters: The above configuration shows the most commonly used parameters. For advanced use cases, additional parameters like embedding_max_chars (to control text truncation), kmeans_oversampling_factor (for K-means optimization), and I/O parameters are available. See the complete parameter table below for all options.

Embedding Models#

You can choose alternative pre-trained models for embedding generation by modifying the embedding_model_name_or_path parameter in the configuration file.

Sentence Transformer

Sentence transformers are ideal for text-based semantic similarity tasks.

workflow = TextSemanticDeduplicationWorkflow(
    model_identifier="sentence-transformers/all-MiniLM-L6-v2",
    # ... other parameters
)

HuggingFace Model

workflow = TextSemanticDeduplicationWorkflow(
    model_identifier="facebook/opt-125m",
    # ... other parameters
)

You can also use your own pre-trained custom models by specifying the path.

When changing the model, ensure that:

The model is compatible with the data type you’re working with
You adjust the embedding_model_inference_batch_size parameter for your model’s memory requirements
The chosen model is appropriate for the language or domain of your dataset

Deduplication Threshold#

The semantic deduplication process is controlled by the similarity threshold parameter:

workflow = TextSemanticDeduplicationWorkflow(
    # ... other parameters
    eps=0.01  # Similarity threshold
)

eps: The similarity threshold used for identifying duplicates. This value determines how similar documents need to be to be considered duplicates. Lower values are more strict, requiring higher similarity for documents to be considered duplicates.

When choosing an appropriate threshold:

Lower thresholds (for example, 0.001): More strict, resulting in less deduplication but higher confidence in the identified duplicates
Higher thresholds (for example, 0.1): Less strict, leading to more aggressive deduplication but potentially removing documents that are only somewhat similar

We recommend experimenting with different threshold values to find the optimal balance between data reduction and maintaining dataset diversity and quality. The impact of this threshold can vary depending on the nature and size of your dataset.

Usage#

End-to-End Workflow

You can use the TextSemanticDeduplicationWorkflow class to perform all steps:

from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow
from nemo_curator.backends import RayDataExecutor

# Initialize workflow with configuration
workflow = TextSemanticDeduplicationWorkflow(
    input_path="input_data/",
    output_path="results/",
    cache_path="cache/",
    text_field="text",
    id_field="doc_id",
    model_identifier="sentence-transformers/all-MiniLM-L6-v2",
    n_clusters=100,
    eps=0.01,
    perform_removal=False,  # Two-step process
    verbose=True
)

# Create executor
executor = RayDataExecutor()

# Two-step semantic deduplication process
# Step 1: Identify duplicates (saves duplicate IDs to output_path/duplicates/)
results = workflow.run(executor)

# Step 2: For manual removal, set perform_removal=True and re-run
workflow.perform_removal = True
final_results = workflow.run(executor)
# Clean dataset saved to output_path/deduplicated/

# Alternative: One-step process
workflow_onestep = TextSemanticDeduplicationWorkflow(
    input_path="input_data/",
    output_path="results/",
    cache_path="cache/",
    id_field="doc_id",
    perform_removal=True  # Complete deduplication
)
results = workflow_onestep.run(executor)

This approach allows for easy experimentation with different configurations and models without changing the core code.

Tip

Flexible Interface: The TextSemanticDeduplicationWorkflow class supports both one-step and two-step workflows:

Use perform_removal=True for direct deduplication (saves clean dataset)
Use perform_removal=False for manual control over the removal process (saves duplicate IDs only)

This interface provides comprehensive end-to-end semantic deduplication capabilities.

Step-by-Step Workflow

For advanced users who need fine-grained control over each stage, semantic deduplication can be broken down into separate components:

from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor
from nemo_curator.stages.text.embedders import EmbeddingCreatorStage
from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow, IdentifyDuplicatesStage
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import ParquetReader
from nemo_curator.stages.text.io.writer import ParquetWriter

# Step 1: Create ID generator for consistent tracking
create_id_generator_actor()

# Step 2: Generate embeddings separately
embedding_pipeline = Pipeline(
    name="embedding_pipeline",
    stages=[
        ParquetReader(file_paths=input_path, files_per_partition=1, fields=["text"], _generate_ids=True),
        EmbeddingCreatorStage(
            model_identifier="sentence-transformers/all-MiniLM-L6-v2",
            text_field="text",
            max_seq_length=None,
            max_chars=None,
            embedding_pooling="mean_pooling",
            model_inference_batch_size=256,
        ),
        ParquetWriter(path=embedding_output_path, fields=["_curator_dedup_id", "embeddings"]),
    ],
)
embedding_out = embedding_pipeline.run()

# Step 3: Run clustering and pairwise similarity (without duplicate identification)
semantic_workflow = SemanticDeduplicationWorkflow(
    input_path=embedding_output_path,
    output_path=semantic_workflow_path,
    n_clusters=100,
    id_field="_curator_dedup_id",
    embedding_field="embeddings",
    ranking_strategy=RankingStrategy(metadata_cols=["cosine_dist_to_cent"], ascending=True),
    eps=None,  # Skip duplicate identification for analysis
)
semantic_out = semantic_workflow.run()

# Step 4: Analyze similarity distribution to choose eps
import pandas as pd
import numpy as np
from collections import Counter
from functools import reduce

pairwise_path = os.path.join(semantic_workflow_path, "pairwise_results")

def get_bins(df: pd.DataFrame, num_bins: int = 1_000) -> dict[float, int]:
    bins = np.linspace(0, 1.01, num_bins)
    return Counter(
        pd.cut(df["cosine_sim_score"], bins=bins, labels=bins[1:], retbins=False, include_lowest=True, right=True)
        .value_counts()
        .to_dict()
    )

# Analyze similarity distribution across all clusters
similarity_across_dataset = reduce(
    lambda x, y: x + y,
    [
        get_bins(pd.read_parquet(os.path.join(pairwise_path, f), columns=["cosine_sim_score"]), num_bins=1000)
        for f in os.listdir(pairwise_path)
    ],
)

# Plot distribution to choose appropriate eps
import matplotlib.pyplot as plt
plt.ecdf(x=similarity_across_dataset.keys(), weights=similarity_across_dataset.values())
plt.xlabel("Cosine Similarity")
plt.ylabel("Ratio of dataset below the similarity score")
plt.show()

# Step 5: Identify duplicates with chosen eps
duplicates_pipeline = Pipeline(
    name="identify_duplicates_pipeline",
    stages=[
        FilePartitioningStage(file_paths=pairwise_path, files_per_partition=1),
        IdentifyDuplicatesStage(output_path=duplicates_output_path, eps=0.1),
    ],
)
identify_duplicates_out = duplicates_pipeline.run()

# Step 6: Remove duplicates from original dataset
removal_workflow = TextDuplicatesRemovalWorkflow(
    input_path=input_path,
    ids_to_remove_path=duplicates_output_path,
    output_path=os.path.join(output_path, "deduplicated"),
    input_filetype="parquet",
    input_fields=["text"],
    input_files_per_partition=1,
    output_filetype="parquet",
    output_fields=["text", "_curator_dedup_id"],
    ids_to_remove_duplicate_id_field="id",
    id_generator_path=id_generator_path,
)
removal_out = removal_workflow.run()

This step-by-step approach provides:

Analysis capabilities: Inspect intermediate results (embeddings, clusters, pairwise similarities)
Parameter tuning: Choose optimal eps based on similarity distribution analysis
Flexibility: Run different stages on different machines or with different configurations
Debugging: Isolate issues to specific stages

Individual Components

Embedding Creation:

from nemo_curator.stages.text.embedders import EmbeddingCreatorStage
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends import RayDataExecutor

# Create pipeline for embedding generation
pipeline = Pipeline(name="embedding_generation")

# Add embedding stage
embedding_stage = EmbeddingCreatorStage(
    model_identifier="sentence-transformers/all-MiniLM-L6-v2",
    text_field="text",
    embedding_field="embeddings",
    model_inference_batch_size=256
)
pipeline.add_stage(embedding_stage)

# Run pipeline
executor = RayDataExecutor()
results = pipeline.run(executor)

Semantic Deduplication Workflow:

from nemo_curator.stages.deduplication.semantic.workflow import SemanticDeduplicationWorkflow

# Perform semantic deduplication on pre-generated embeddings
workflow = SemanticDeduplicationWorkflow(
    input_path="path/to/embeddings/",  # Directory with embedding parquet files
    output_path="path/to/output/",
    n_clusters=100,
    id_field="doc_id",
    embedding_field="embeddings",
    which_to_keep="hard",
    pairwise_batch_size=1024,
    eps=0.01,  # Similarity threshold for duplicate identification
    verbose=True
)

# Run the workflow
results = workflow.run(pairwise_executor=executor)

# Results contain timing information and duplicate counts
print(f"Total duplicates identified: {results['total_duplicates_identified']}")
print(f"Execution time: {results['total_execution_time']:.2f} seconds")

# Duplicate IDs are saved to output_path/duplicates/

Comparison with Other Deduplication Methods#

Table 12 Deduplication Method Behavior Comparison#
Method	Return Value Options	perform_removal Parameter	Workflow
ExactDuplicates	Duplicates or Clean Dataset	✅ Available	One-step or two-step
FuzzyDuplicates	Duplicates or Clean Dataset	✅ Available	One-step or two-step
TextSemanticDeduplicationWorkflow	Duplicates or Clean Dataset	✅ Available	One-step or two-step

Key Parameters#

Table 13 Key Configuration Parameters#
Parameter	Type	Default	Description
`model_identifier`	str	“sentence-transformers/all-MiniLM-L6-v2”	Pre-trained model for embedding generation
`embedding_model_inference_batch_size`	int	256	Number of samples per embedding batch
`n_clusters`	int	100	Number of clusters for k-means clustering
`kmeans_max_iter`	int	300	Maximum iterations for clustering
`eps`	float	0.01	Threshold for deduplication (higher = more aggressive)
`which_to_keep`	str	“hard”	Strategy for keeping duplicates (“hard”/”easy”/”random”)
`pairwise_batch_size`	int	1024	Batch size for similarity computation
`distance_metric`	str	“cosine”	Distance metric for similarity (“cosine” or “l2”)
`embedding_pooling`	str	“mean_pooling”	Pooling strategy (“mean_pooling” or “last_token”)
`perform_removal`	bool	true	Whether to perform duplicate removal
`text_field`	str	“text”	Name of the text field in input data
`id_field`	str	“id”	Name of the ID field in the data

Output Format#

The semantic deduplication process produces the following directory structure in your configured cache_path:

cache_path/
├── embeddings/                           # Embedding outputs
│   └── *.parquet                         # Parquet files containing document embeddings
├── semantic_dedup/                       # Semantic deduplication cache
│   ├── kmeans_results/                   # K-means clustering outputs
│   │   ├── kmeans_centroids.npy         # Cluster centroids
│   │   └── embs_by_nearest_center/      # Embeddings organized by cluster
│   │       └── nearest_cent={0..n-1}/   # Subdirectories for each cluster
│   │           └── *.parquet            # Cluster member embeddings
│   └── pairwise_results/                # Pairwise similarity results
│       └── *.parquet                    # Similarity scores by cluster
└── output_path/
    ├── duplicates/                       # Duplicate identification results
    │   └── *.parquet                    # Document IDs to remove
    └── deduplicated/                     # Final clean dataset (if perform_removal=True)
        └── *.parquet                    # Deduplicated documents

File Formats#

Document Embeddings (embeddings/*.parquet):
- Contains document IDs and their vector embeddings
- Format: Parquet files with columns: [id_column, embedding_column]
Cluster Assignments (clustering_results/):
- kmeans_centroids.npy: NumPy array of cluster centers
- embs_by_nearest_center/: Parquet files containing cluster members
- Format: Parquet files with columns: [id_column, embedding_column, cluster_id]
Deduplicated Results (output_path/duplicates/*.parquet):
- Final output containing document IDs to remove after deduplication
- Format: Parquet file with columns: ["id"]
- Important: Contains only the IDs of documents to remove, not the full document content
- When perform_removal=True, clean dataset is saved to output_path/deduplicated/

Typically, semantic deduplication reduces dataset size by 20–50% while maintaining or improving model performance.

Advanced Configuration#

ID Generator for Large-Scale Operations#

For large-scale duplicate removal operations, the ID Generator is essential for consistent document tracking across workflow stages:

from nemo_curator.stages.deduplication.id_generator import (
    create_id_generator_actor,
    write_id_generator_to_disk,
    kill_id_generator_actor
)

# Create and manage ID generator
create_id_generator_actor()

# Persist ID generator state for later use
id_generator_path = "semantic_id_generator.json"
write_id_generator_to_disk(id_generator_path)
kill_id_generator_actor()

# Use persisted ID generator in removal workflow
removal_workflow = TextDuplicatesRemovalWorkflow(
    input_path=input_path,
    ids_to_remove_path=duplicates_path,
    output_path=output_path,
    id_generator_path=id_generator_path,
    # Important: Match the same partitioning as embedding generation
    input_files_per_partition=1,
    # ... other parameters
)

Critical Requirements:

The same input configuration (file paths, partitioning) must be used across all stages
ID consistency is maintained by hashing filenames in each task
Mismatched partitioning between stages will cause ID lookup failures

Ray Backend Configuration#

For optimal performance with large datasets, configure the Ray backend appropriately:

from nemo_curator.core.client import RayClient

# Configure Ray cluster for semantic deduplication
client = RayClient(
    num_cpus=64,    # Adjust based on available cores
    num_gpus=4      # Should be roughly 2x the memory of embeddings
)
client.start()

try:
    # Run your semantic deduplication workflow
    workflow = TextSemanticDeduplicationWorkflow(
        input_path=input_path,
        output_path=output_path,
        cache_path=cache_path,
        # ... other parameters
    )
    results = workflow.run()
finally:
    client.stop()

The Ray backend provides:

Distributed processing across multiple GPUs
Memory management for large embedding matrices
Fault tolerance for long-running workflows

Performance Considerations#

Semantic deduplication is computationally intensive, especially for large datasets. However, the benefits in terms of reduced training time and improved model performance often outweigh the upfront cost:

Use GPU acceleration for faster embedding generation and clustering
Adjust the number of clusters (n_clusters) based on your dataset size and available resources
The eps_to_extract parameter controls the trade-off between dataset size reduction and potential information loss
Using batched cosine similarity significantly reduces memory requirements for large datasets

GPU Requirements#

Hardware Prerequisites:

NVIDIA GPU with CUDA support
Sufficient GPU memory (recommended: >8GB for medium datasets)
RAPIDS libraries (cuDF) for GPU-accelerated dataframe operations

Backend Requirements:

Required: cuDF for GPU-accelerated dataframe operations
Not supported: CPU-only processing (use hash-based deduplication instead)

Performance Characteristics:

Embedding Generation: GPU-accelerated using PyTorch models
Clustering: GPU-accelerated k-means clustering
Similarity Computation: Batched GPU operations for cosine similarity

Table 14 Performance Scaling#
Dataset Size	GPU Memory Required	Processing Time	Recommended GPUs
<100K docs	4-8 GB	1-2 hours	RTX 3080, A100
100K-1M docs	8-16 GB	2-8 hours	RTX 4090, A100
1M docs	16 GB	8+ hours	A100, H100

For very large datasets, consider distributed processing across multiple GPUs or use incremental processing approaches.

For more details on the algorithm and its performance implications, refer to the original paper: SemDeDup: Data-efficient learning at web-scale through semantic deduplication by Abbas et al.