`stages.text.deduplication.semantic`#

Monolithic Text Semantic Deduplication Workflow.

This module contains a complete end-to-end workflow for text semantic deduplication:

Embedding generation from text data
Semantic deduplication using clustering and pairwise similarity
Optional duplicate removal based on identified duplicates

Module Contents#

Classes#

TextSemanticDeduplicationWorkflow

Monolithic workflow for end-to-end text semantic deduplication.

API#

class stages.text.deduplication.semantic.TextSemanticDeduplicationWorkflow#

Monolithic workflow for end-to-end text semantic deduplication.

This workflow combines:

Text embedding generation (configurable executor)
Semantic deduplication (configurable executor for pairwise stage)
Duplicate removal (configurable executor)

Supports flexible executor configuration - can use a single executor for all stages or different executors for different phases.

cache_kwargs: dict[str, Any]#: ‘field(…)’

cache_path: str | None#: None

clear_output: bool#

True

Initialize the text semantic deduplication workflow.

Args: input_path: Path(s) to input files containing text data output_path: Directory to write deduplicated (or ids to remove) output cache_path: Directory to cache intermediate results (embeddings, kmeans, pairwise, etc.) perform_removal: Whether to perform duplicate removal (True) or just identify duplicates (False)

# Embedding generation parameters
text_field: Name of the text field in input data
embedding_field: Name of the embedding field to create
model_identifier: HuggingFace model identifier for embeddings
embedding_max_seq_length: Maximum sequence length for tokenization
embedding_max_chars: Maximum number of characters for tokenization
embedding_padding_side: Padding side for tokenization
embedding_pooling: Pooling strategy for embeddings
embedding_model_inference_batch_size: Batch size for model inference
hf_token: HuggingFace token for private models

# Semantic deduplication parameters
n_clusters: Number of clusters for K-means
id_field: Name of the ID field in the data
embedding_dim: Embedding dimension (for memory estimation)
metadata_fields: List of metadata field names to preserve
distance_metric: Distance metric for similarity ("cosine" or "l2")
which_to_keep: Strategy for ranking within clusters ("hard", "easy", "random")
eps: Epsilon value for duplicate identification (None to skip)
kmeans_max_iter: Maximum number of iterations for K-means clustering
kmeans_tol: Tolerance for K-means convergence
kmeans_random_state: Random state for K-means (None for random)
kmeans_init: Initialization method for K-means centroids
kmeans_n_init: Number of K-means initialization runs
kmeans_oversampling_factor: Oversampling factor for K-means
kmeans_max_samples_per_batch: Maximum samples per batch for K-means
ranking_strategy: Custom ranking strategy for documents within clusters (None uses which_to_keep/distance_metric)
pairwise_batch_size: Batch size for pairwise similarity computation
_duplicates_num_row_groups_hint: Hint for number of row groups in duplicates output

# ID generator parameters
use_id_generator: Whether to use ID generator for document IDs
id_generator_state_file: Path to save/load ID generator state (auto-generated if None)

# I/O parameters
input_files_per_partition: Number of files per partition for reading
input_blocksize: Blocksize for reading files
input_filetype: Type of input files ("jsonl" or "parquet")
input_file_extensions: List of file extensions to process
output_filetype: Type of deduplicated output files ("jsonl" or "parquet")
output_file_extension: File extension for deduplicated output files (None for default)
output_fields: List of fields to include in final deduplicated output (None for all fields)
read_kwargs: Keyword arguments for reading files
cache_kwargs: Keyword arguments for cache operations and storage
write_kwargs: Keyword arguments for writing files

# Execution parameters
verbose: Enable verbose output
clear_output: Clear output directory before running

distance_metric: Literal[cosine, l2]#: ‘cosine’

embedding_dim: int | None#: None

embedding_field: str#: ‘embeddings’

embedding_max_chars: int | None#: None

embedding_max_seq_length: int#: 512

embedding_model_inference_batch_size: int#: 256

embedding_padding_side: Literal[left, right]#: ‘right’

embedding_pooling: Literal[mean_pooling, last_token]#: ‘mean_pooling’

eps: float | None#: 0.01

hf_token: str | None#: None

id_field: str#: None

id_generator_state_file: str | None#: None

input_blocksize: int | None#: None

input_file_extensions: list[str] | None#: None

input_files_per_partition: int | None#: None

input_filetype: Literal[jsonl, parquet]#: ‘parquet’

input_path: str | list[str]#: None

kmeans_init: str#: ‘k-means||’

kmeans_max_iter: int#: 300

kmeans_max_samples_per_batch: int#: None

kmeans_n_init: int | Literal[auto]#: 1

kmeans_oversampling_factor: float#: 2.0

kmeans_random_state: int#: 42

kmeans_tol: float#: 0.0001

metadata_fields: list[str] | None#: None

model_identifier: str#: ‘sentence-transformers/all-MiniLM-L6-v2’

n_clusters: int#: 100

output_fields: list[str] | None#: None

output_file_extension: str | None#: None

output_filetype: Literal[jsonl, parquet]#: ‘parquet’

output_path: str#: None

pairwise_batch_size: int#: 1024

perform_removal: bool#: True

ranking_strategy: nemo_curator.stages.deduplication.semantic.ranking.RankingStrategy | None#: None

read_kwargs: dict[str, Any]#: ‘field(…)’

run( streaming_executor: nemo_curator.backends.base.BaseExecutor | tuple[nemo_curator.backends.base.BaseExecutor, nemo_curator.backends.base.BaseExecutor, nemo_curator.backends.base.BaseExecutor] | None = None, batch_executor: nemo_curator.backends.base.BaseExecutor | None = None, ) → dict[str, Any]#

Run the complete text semantic deduplication workflow.

Returns: Dictionary with results and timing information from all stages

text_field: str#: ‘text’

use_id_generator: bool#: False

verbose: bool#: True

which_to_keep: Literal[hard, easy, random]#: ‘hard’

write_kwargs: dict[str, Any]#: ‘field(…)’

stages.text.deduplication.semantic#

Module Contents#

Classes#

API#

`stages.text.deduplication.semantic`#