stages.deduplication.semantic.pairwise
#
Module Contents#
Classes#
Pairwise cosine similarity stage that computes similarity within clusters. |
|
Pairwise similarity stage for semantic deduplication. |
Functions#
Computes pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity. This function is useful for large clusters where the pairwise similarity matrix does not fit into memory. We use a batched approach to compute the pairwise similarity matrix in batches. Memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size instead of O(N^2) for the full matrix. |
API#
- class stages.deduplication.semantic.pairwise.PairwiseCosineSimilarityStage(
- id_field: str,
- embedding_field: str,
- output_path: str,
- ranking_strategy: stages.deduplication.semantic.ranking.RankingStrategy,
- pairwise_batch_size: int = 1024,
- verbose: bool = False,
- embedding_dim: int | None = None,
- read_kwargs: dict[str, Any] | None = None,
- write_kwargs: dict[str, Any] | None = None,
Bases:
nemo_curator.stages.base.ProcessingStage
[nemo_curator.tasks.FileGroupTask
,nemo_curator.tasks.FileGroupTask
],nemo_curator.stages.deduplication.io_utils.DeduplicationIO
Pairwise cosine similarity stage that computes similarity within clusters.
Initialization
Initialize the pairwise cosine similarity stage.
Args: id_field: The column name of the id column. embedding_field: The column name of the embedding column. output_path: The path to the output directory. ranking_strategy: Strategy for ranking/sorting clusters before similarity computation. pairwise_batch_size: Batch size for pairwise similarity computation. verbose: Whether to print verbose output. embedding_dim: Embedding dimension for memory estimation. read_kwargs: Kwargs for reading parquet files. write_kwargs: Kwargs for writing parquet files.
- process(
- task: nemo_curator.tasks.FileGroupTask,
Process a PairwiseFileGroupTask to compute pairwise similarities.
- class stages.deduplication.semantic.pairwise.PairwiseStage#
Bases:
nemo_curator.stages.base.CompositeStage
[nemo_curator.tasks._EmptyTask
,nemo_curator.tasks.FileGroupTask
]Pairwise similarity stage for semantic deduplication.
Initialization
- decompose() list[nemo_curator.stages.base.ProcessingStage] #
Decompose into execution stages.
This method must be implemented by composite stages to define what low-level stages they represent.
Returns (list[ProcessingStage]): List of execution stages that will actually run
- embedding_dim: int | None#
None
- embedding_field: str#
None
- id_field: str#
None
- input_path: str#
None
- output_path: str#
None
- pairwise_batch_size: int#
1024
- random_seed: int#
42
- ranking_strategy: stages.deduplication.semantic.ranking.RankingStrategy | None#
None
- read_kwargs: dict[str, Any] | None#
None
- sim_metric: Literal[cosine, l2]#
‘cosine’
- verbose: bool#
False
- which_to_keep: Literal[hard, easy, random]#
‘hard’
- write_kwargs: dict[str, Any] | None#
None
- stages.deduplication.semantic.pairwise.pairwise_cosine_similarity_batched(
- cluster_reps: torch.Tensor,
- batch_size: int = 1024,
Computes pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity. This function is useful for large clusters where the pairwise similarity matrix does not fit into memory. We use a batched approach to compute the pairwise similarity matrix in batches. Memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size instead of O(N^2) for the full matrix.
TODO: In future we can estimate memory requirement and calculate batch size dynamically.