stages.deduplication.semantic.pairwise#

Module Contents#

Classes#

PairwiseCosineSimilarityStage

Pairwise cosine similarity stage that computes similarity within clusters.

PairwiseStage

Pairwise similarity stage for semantic deduplication.

Functions#

pairwise_cosine_similarity_batched

Computes pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity. This function is useful for large clusters where the pairwise similarity matrix does not fit into memory. We use a batched approach to compute the pairwise similarity matrix in batches. Memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size instead of O(N^2) for the full matrix.

API#

class stages.deduplication.semantic.pairwise.PairwiseCosineSimilarityStage(
id_field: str,
embedding_field: str,
output_path: str,
ranking_strategy: stages.deduplication.semantic.ranking.RankingStrategy,
pairwise_batch_size: int = 1024,
verbose: bool = False,
embedding_dim: int | None = None,
read_kwargs: dict[str, Any] | None = None,
write_kwargs: dict[str, Any] | None = None,
)#

Bases: nemo_curator.stages.base.ProcessingStage[nemo_curator.tasks.FileGroupTask, nemo_curator.tasks.FileGroupTask], nemo_curator.stages.deduplication.io_utils.DeduplicationIO

Pairwise cosine similarity stage that computes similarity within clusters.

Initialization

Initialize the pairwise cosine similarity stage.

Args: id_field: The column name of the id column. embedding_field: The column name of the embedding column. output_path: The path to the output directory. ranking_strategy: Strategy for ranking/sorting clusters before similarity computation. pairwise_batch_size: Batch size for pairwise similarity computation. verbose: Whether to print verbose output. embedding_dim: Embedding dimension for memory estimation. read_kwargs: Kwargs for reading parquet files. write_kwargs: Kwargs for writing parquet files.

process(
task: nemo_curator.tasks.FileGroupTask,
) nemo_curator.tasks.FileGroupTask#

Process a PairwiseFileGroupTask to compute pairwise similarities.

class stages.deduplication.semantic.pairwise.PairwiseStage#

Bases: nemo_curator.stages.base.CompositeStage[nemo_curator.tasks._EmptyTask, nemo_curator.tasks.FileGroupTask]

Pairwise similarity stage for semantic deduplication.

Initialization

decompose() list[nemo_curator.stages.base.ProcessingStage]#

Decompose into execution stages.

This method must be implemented by composite stages to define what low-level stages they represent.

Returns (list[ProcessingStage]): List of execution stages that will actually run

embedding_dim: int | None#

None

embedding_field: str#

None

id_field: str#

None

input_path: str#

None

output_path: str#

None

pairwise_batch_size: int#

1024

random_seed: int#

42

ranking_strategy: stages.deduplication.semantic.ranking.RankingStrategy | None#

None

read_kwargs: dict[str, Any] | None#

None

sim_metric: Literal[cosine, l2]#

‘cosine’

verbose: bool#

False

which_to_keep: Literal[hard, easy, random]#

‘hard’

write_kwargs: dict[str, Any] | None#

None

stages.deduplication.semantic.pairwise.pairwise_cosine_similarity_batched(
cluster_reps: torch.Tensor,
batch_size: int = 1024,
) tuple[cupy.ndarray, cupy.ndarray] | tuple[numpy.ndarray, numpy.ndarray]#

Computes pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity. This function is useful for large clusters where the pairwise similarity matrix does not fit into memory. We use a batched approach to compute the pairwise similarity matrix in batches. Memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size instead of O(N^2) for the full matrix.

TODO: In future we can estimate memory requirement and calculate batch size dynamically.