nemo_curator.stages.deduplication.semantic.pairwise

View as MarkdownOpen in Claude

Module Contents

Classes

NameDescription
PairwiseCosineSimilarityStagePairwise cosine similarity stage that computes similarity within clusters.
PairwiseStagePairwise similarity stage for semantic deduplication.

Functions

NameDescription
pairwise_cosine_similarity_batchedComputes pairwise cosine similarity between cluster items,

API

class nemo_curator.stages.deduplication.semantic.pairwise.PairwiseCosineSimilarityStage(
id_field: str,
embedding_field: str,
output_path: str,
ranking_strategy: nemo_curator.stages.deduplication.semantic.ranking.RankingStrategy,
pairwise_batch_size: int = 1024,
verbose: bool = False,
embedding_dim: int | None = None,
read_kwargs: dict[str, typing.Any] | None = None,
write_kwargs: dict[str, typing.Any] | None = None
)

Bases: ProcessingStage[FileGroupTask, FileGroupTask], DeduplicationIO

Pairwise cosine similarity stage that computes similarity within clusters.

input_storage_options
name
= 'PairwiseCosineSimilarityStage'
output_storage_options
read_kwargs
resources
= Resources(cpus=1.0, gpus=1.0)
write_kwargs
nemo_curator.stages.deduplication.semantic.pairwise.PairwiseCosineSimilarityStage.process(
task: nemo_curator.tasks.FileGroupTask
) -> nemo_curator.tasks.FileGroupTask

Process a PairwiseFileGroupTask to compute pairwise similarities.

class nemo_curator.stages.deduplication.semantic.pairwise.PairwiseStage(
id_field: str,
embedding_field: str,
input_path: str,
output_path: str,
ranking_strategy: nemo_curator.stages.deduplication.semantic.ranking.RankingStrategy | None = None,
embedding_dim: int | None = None,
pairwise_batch_size: int = 1024,
verbose: bool = False,
read_kwargs: dict[str, typing.Any] | None = None,
write_kwargs: dict[str, typing.Any] | None = None,
which_to_keep: typing.Literal['hard', 'easy', 'random'] = 'hard',
sim_metric: typing.Literal['cosine', 'l2'] = 'cosine',
random_seed: int = 42
)
Dataclass

Bases: CompositeStage[_EmptyTask, FileGroupTask]

Pairwise similarity stage for semantic deduplication.

embedding_dim
int | None = None
embedding_field
str
id_field
str
input_path
str
output_path
str
pairwise_batch_size
int = 1024
random_seed
int = 42
ranking_strategy
RankingStrategy | None = None
read_kwargs
dict[str, Any] | None = None
sim_metric
Literal['cosine', 'l2'] = 'cosine'
verbose
bool = False
which_to_keep
Literal['hard', 'easy', 'random'] = 'hard'
write_kwargs
dict[str, Any] | None = None
nemo_curator.stages.deduplication.semantic.pairwise.PairwiseStage.__post_init__()

Initialize parent class after dataclass initialization.

nemo_curator.stages.deduplication.semantic.pairwise.PairwiseStage.decompose() -> list[nemo_curator.stages.base.ProcessingStage]
nemo_curator.stages.deduplication.semantic.pairwise.pairwise_cosine_similarity_batched(
cluster_reps: torch.Tensor,
batch_size: int = 1024
) -> tuple[cupy.ndarray, cupy.ndarray] | tuple[numpy.ndarray, numpy.ndarray]

Computes pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity. This function is useful for large clusters where the pairwise similarity matrix does not fit into memory. We use a batched approach to compute the pairwise similarity matrix in batches. Memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size instead of O(N^2) for the full matrix.

TODO: In future we can estimate memory requirement and calculate batch size dynamically.