stages.text.deduplication.semantic#

Monolithic Text Semantic Deduplication Workflow.

This module contains a complete end-to-end workflow for text semantic deduplication:

  1. Embedding generation from text data

  2. Semantic deduplication using clustering and pairwise similarity

  3. Optional duplicate removal based on identified duplicates

Module Contents#

Classes#

TextSemanticDeduplicationWorkflow

Monolithic workflow for end-to-end text semantic deduplication.

API#

class stages.text.deduplication.semantic.TextSemanticDeduplicationWorkflow#

Monolithic workflow for end-to-end text semantic deduplication.

This workflow combines:

  1. Text embedding generation (configurable executor)

  2. Semantic deduplication (configurable executor for pairwise stage)

  3. Duplicate removal (configurable executor)

Supports flexible executor configuration - can use a single executor for all stages or different executors for different phases.

cache_kwargs: dict[str, Any]#

‘field(…)’

cache_path: str | None#

None

clear_output: bool#

True

Initialize the text semantic deduplication workflow.

Args: input_path: Path(s) to input files containing text data output_path: Directory to write deduplicated (or ids to remove) output cache_path: Directory to cache intermediate results (embeddings, kmeans, pairwise, etc.) perform_removal: Whether to perform duplicate removal (True) or just identify duplicates (False)

# Embedding generation parameters
text_field: Name of the text field in input data
embedding_field: Name of the embedding field to create
model_identifier: HuggingFace model identifier for embeddings
embedding_max_seq_length: Maximum sequence length for tokenization
embedding_max_chars: Maximum number of characters for tokenization
embedding_padding_side: Padding side for tokenization
embedding_pooling: Pooling strategy for embeddings
embedding_model_inference_batch_size: Batch size for model inference
hf_token: HuggingFace token for private models

# Semantic deduplication parameters
n_clusters: Number of clusters for K-means
id_field: Name of the ID field in the data
embedding_dim: Embedding dimension (for memory estimation)
metadata_fields: List of metadata field names to preserve
distance_metric: Distance metric for similarity ("cosine" or "l2")
which_to_keep: Strategy for ranking within clusters ("hard", "easy", "random")
eps: Epsilon value for duplicate identification (None to skip)
kmeans_max_iter: Maximum number of iterations for K-means clustering
kmeans_tol: Tolerance for K-means convergence
kmeans_random_state: Random state for K-means (None for random)
kmeans_init: Initialization method for K-means centroids
kmeans_n_init: Number of K-means initialization runs
kmeans_oversampling_factor: Oversampling factor for K-means
kmeans_max_samples_per_batch: Maximum samples per batch for K-means
ranking_strategy: Custom ranking strategy for documents within clusters (None uses which_to_keep/distance_metric)
pairwise_batch_size: Batch size for pairwise similarity computation
_duplicates_num_row_groups_hint: Hint for number of row groups in duplicates output

# ID generator parameters
use_id_generator: Whether to use ID generator for document IDs
id_generator_state_file: Path to save/load ID generator state (auto-generated if None)

# I/O parameters
input_files_per_partition: Number of files per partition for reading
input_blocksize: Blocksize for reading files
input_filetype: Type of input files ("jsonl" or "parquet")
input_file_extensions: List of file extensions to process
output_filetype: Type of output files ("jsonl" or "parquet")
output_file_extension: File extension for output files (None for default)
output_fields: List of fields to include in final output (None for all fields)
read_kwargs: Keyword arguments for reading files
cache_kwargs: Keyword arguments for cache operations and storage
write_kwargs: Keyword arguments for writing files

# Execution parameters
verbose: Enable verbose output
clear_output: Clear output directory before running
distance_metric: Literal[cosine, l2]#

‘cosine’

embedding_dim: int | None#

None

embedding_field: str#

‘embeddings’

embedding_max_chars: int | None#

None

embedding_max_seq_length: int#

512

embedding_model_inference_batch_size: int#

256

embedding_padding_side: Literal[left, right]#

‘right’

embedding_pooling: Literal[mean_pooling, last_token]#

‘mean_pooling’

eps: float | None#

0.01

hf_token: str | None#

None

id_field: str#

None

id_generator_state_file: str | None#

None

input_blocksize: int | None#

None

input_file_extensions: list[str] | None#

None

input_files_per_partition: int | None#

None

input_filetype: Literal[jsonl, parquet]#

‘parquet’

input_path: str | list[str]#

None

kmeans_init: str#

‘k-means||’

kmeans_max_iter: int#

300

kmeans_max_samples_per_batch: int#

None

kmeans_n_init: int | Literal[auto]#

1

kmeans_oversampling_factor: float#

2.0

kmeans_random_state: int#

42

kmeans_tol: float#

0.0001

metadata_fields: list[str] | None#

None

model_identifier: str#

‘sentence-transformers/all-MiniLM-L6-v2’

n_clusters: int#

100

output_fields: list[str] | None#

None

output_file_extension: str | None#

None

output_filetype: Literal[jsonl, parquet]#

‘parquet’

output_path: str#

None

pairwise_batch_size: int#

1024

perform_removal: bool#

True

ranking_strategy: nemo_curator.stages.deduplication.semantic.ranking.RankingStrategy | None#

None

read_kwargs: dict[str, Any]#

‘field(…)’

run(
executor: nemo_curator.backends.base.BaseExecutor | tuple[nemo_curator.backends.base.BaseExecutor, nemo_curator.backends.base.BaseExecutor, nemo_curator.backends.base.BaseExecutor] | None = None,
) dict[str, Any]#

Run the complete text semantic deduplication workflow.

Returns: Dictionary with results and timing information from all stages

text_field: str#

‘text’

use_id_generator: bool#

False

verbose: bool#

True

which_to_keep: Literal[hard, easy, random]#

‘hard’

write_kwargs: dict[str, Any]#

‘field(…)’