nemo_curator.stages.deduplication.semantic.pairwise
nemo_curator.stages.deduplication.semantic.pairwise
Module Contents
Classes
Functions
API
Bases: ProcessingStage[FileGroupTask, FileGroupTask], DeduplicationIO
Pairwise cosine similarity stage that computes similarity within clusters.
Process a PairwiseFileGroupTask to compute pairwise similarities.
Bases: CompositeStage[_EmptyTask, FileGroupTask]
Pairwise similarity stage for semantic deduplication.
Initialize parent class after dataclass initialization.
Computes pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity. This function is useful for large clusters where the pairwise similarity matrix does not fit into memory. We use a batched approach to compute the pairwise similarity matrix in batches. Memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size instead of O(N^2) for the full matrix.
TODO: In future we can estimate memory requirement and calculate batch size dynamically.