stages.text.deduplication.semantic
#
Monolithic Text Semantic Deduplication Workflow.
This module contains a complete end-to-end workflow for text semantic deduplication:
Embedding generation from text data
Semantic deduplication using clustering and pairwise similarity
Optional duplicate removal based on identified duplicates
Module Contents#
Classes#
Monolithic workflow for end-to-end text semantic deduplication. |
API#
- class stages.text.deduplication.semantic.TextSemanticDeduplicationWorkflow#
Monolithic workflow for end-to-end text semantic deduplication.
This workflow combines:
Text embedding generation (configurable executor)
Semantic deduplication (configurable executor for pairwise stage)
Duplicate removal (configurable executor)
Supports flexible executor configuration - can use a single executor for all stages or different executors for different phases.
- cache_kwargs: dict[str, Any]#
‘field(…)’
- cache_path: str | None#
None
- clear_output: bool#
True
Initialize the text semantic deduplication workflow.
Args: input_path: Path(s) to input files containing text data output_path: Directory to write deduplicated (or ids to remove) output cache_path: Directory to cache intermediate results (embeddings, kmeans, pairwise, etc.) perform_removal: Whether to perform duplicate removal (True) or just identify duplicates (False)
# Embedding generation parameters text_field: Name of the text field in input data embedding_field: Name of the embedding field to create model_identifier: HuggingFace model identifier for embeddings embedding_max_seq_length: Maximum sequence length for tokenization embedding_max_chars: Maximum number of characters for tokenization embedding_padding_side: Padding side for tokenization embedding_pooling: Pooling strategy for embeddings embedding_model_inference_batch_size: Batch size for model inference hf_token: HuggingFace token for private models # Semantic deduplication parameters n_clusters: Number of clusters for K-means id_field: Name of the ID field in the data embedding_dim: Embedding dimension (for memory estimation) metadata_fields: List of metadata field names to preserve distance_metric: Distance metric for similarity ("cosine" or "l2") which_to_keep: Strategy for ranking within clusters ("hard", "easy", "random") eps: Epsilon value for duplicate identification (None to skip) kmeans_max_iter: Maximum number of iterations for K-means clustering kmeans_tol: Tolerance for K-means convergence kmeans_random_state: Random state for K-means (None for random) kmeans_init: Initialization method for K-means centroids kmeans_n_init: Number of K-means initialization runs kmeans_oversampling_factor: Oversampling factor for K-means kmeans_max_samples_per_batch: Maximum samples per batch for K-means ranking_strategy: Custom ranking strategy for documents within clusters (None uses which_to_keep/distance_metric) pairwise_batch_size: Batch size for pairwise similarity computation _duplicates_num_row_groups_hint: Hint for number of row groups in duplicates output # ID generator parameters use_id_generator: Whether to use ID generator for document IDs id_generator_state_file: Path to save/load ID generator state (auto-generated if None) # I/O parameters input_files_per_partition: Number of files per partition for reading input_blocksize: Blocksize for reading files input_filetype: Type of input files ("jsonl" or "parquet") input_file_extensions: List of file extensions to process output_filetype: Type of output files ("jsonl" or "parquet") output_file_extension: File extension for output files (None for default) output_fields: List of fields to include in final output (None for all fields) read_kwargs: Keyword arguments for reading files cache_kwargs: Keyword arguments for cache operations and storage write_kwargs: Keyword arguments for writing files # Execution parameters verbose: Enable verbose output clear_output: Clear output directory before running
- distance_metric: Literal[cosine, l2]#
‘cosine’
- embedding_dim: int | None#
None
- embedding_field: str#
‘embeddings’
- embedding_max_chars: int | None#
None
- embedding_max_seq_length: int#
512
- embedding_model_inference_batch_size: int#
256
- embedding_padding_side: Literal[left, right]#
‘right’
- embedding_pooling: Literal[mean_pooling, last_token]#
‘mean_pooling’
- eps: float | None#
0.01
- hf_token: str | None#
None
- id_field: str#
None
- id_generator_state_file: str | None#
None
- input_blocksize: int | None#
None
- input_file_extensions: list[str] | None#
None
- input_files_per_partition: int | None#
None
- input_filetype: Literal[jsonl, parquet]#
‘parquet’
- input_path: str | list[str]#
None
- kmeans_init: str#
‘k-means||’
- kmeans_max_iter: int#
300
- kmeans_max_samples_per_batch: int#
None
- kmeans_n_init: int | Literal[auto]#
1
- kmeans_oversampling_factor: float#
2.0
- kmeans_random_state: int#
42
- kmeans_tol: float#
0.0001
- metadata_fields: list[str] | None#
None
- model_identifier: str#
‘sentence-transformers/all-MiniLM-L6-v2’
- n_clusters: int#
100
- output_fields: list[str] | None#
None
- output_file_extension: str | None#
None
- output_filetype: Literal[jsonl, parquet]#
‘parquet’
- output_path: str#
None
- pairwise_batch_size: int#
1024
- perform_removal: bool#
True
- ranking_strategy: nemo_curator.stages.deduplication.semantic.ranking.RankingStrategy | None#
None
- read_kwargs: dict[str, Any]#
‘field(…)’
- run(
- executor: nemo_curator.backends.base.BaseExecutor | tuple[nemo_curator.backends.base.BaseExecutor, nemo_curator.backends.base.BaseExecutor, nemo_curator.backends.base.BaseExecutor] | None = None,
Run the complete text semantic deduplication workflow.
Returns: Dictionary with results and timing information from all stages
- text_field: str#
‘text’
- use_id_generator: bool#
False
- verbose: bool#
True
- which_to_keep: Literal[hard, easy, random]#
‘hard’
- write_kwargs: dict[str, Any]#
‘field(…)’