For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
        • Nemo Curator
          • Backends
          • Config
          • Core
          • Metrics
          • Models
          • Package Info
          • Pipeline
          • Stages
            • Audio
            • Base
            • Client Partitioning
            • Deduplication
              • Exact
              • Fuzzy
              • Gpu Utils
              • Id Generator
              • Io Utils
              • Semantic
                • Identify Duplicates
                • Kmeans
                • Pairwise
                • Pairwise Io
                • Ranking
                • Utils
                • Workflow
              • Shuffle Utils
            • File Partitioning
            • Function Decorators
            • Image
            • Interleaved
            • Math
            • Resources
            • Synthetic
            • Text
            • Video
          • Tasks
          • Utils
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Module Contents
  • Classes
  • Functions
  • API
API ReferenceFull Library ReferenceNemo CuratorNemo CuratorStagesDeduplicationSemantic

nemo_curator.stages.deduplication.semantic.pairwise

||View as Markdown|
Previous

nemo_curator.stages.deduplication.semantic.kmeans

Next

nemo_curator.stages.deduplication.semantic.pairwise_io

Module Contents

Classes

NameDescription
PairwiseCosineSimilarityStagePairwise cosine similarity stage that computes similarity within clusters.
PairwiseStagePairwise similarity stage for semantic deduplication.

Functions

NameDescription
pairwise_cosine_similarity_batchedComputes pairwise cosine similarity between cluster items,

API

class nemo_curator.stages.deduplication.semantic.pairwise.PairwiseCosineSimilarityStage(
id_field: str,
embedding_field: str,
output_path: str,
ranking_strategy: nemo_curator.stages.deduplication.semantic.ranking.RankingStrategy,
pairwise_batch_size: int = 1024,
verbose: bool = False,
embedding_dim: int | None = None,
read_kwargs: dict[str, typing.Any] | None = None,
write_kwargs: dict[str, typing.Any] | None = None
)

Bases: ProcessingStage[FileGroupTask, FileGroupTask], DeduplicationIO

Pairwise cosine similarity stage that computes similarity within clusters.

input_storage_options
name
= 'PairwiseCosineSimilarityStage'
output_storage_options
read_kwargs
resources
= Resources(cpus=1.0, gpus=1.0)
write_kwargs
nemo_curator.stages.deduplication.semantic.pairwise.PairwiseCosineSimilarityStage.process(
task: nemo_curator.tasks.FileGroupTask
) -> nemo_curator.tasks.FileGroupTask

Process a PairwiseFileGroupTask to compute pairwise similarities.

class nemo_curator.stages.deduplication.semantic.pairwise.PairwiseStage(
id_field: str,
embedding_field: str,
input_path: str,
output_path: str,
ranking_strategy: nemo_curator.stages.deduplication.semantic.ranking.RankingStrategy | None = None,
embedding_dim: int | None = None,
pairwise_batch_size: int = 1024,
verbose: bool = False,
read_kwargs: dict[str, typing.Any] | None = None,
write_kwargs: dict[str, typing.Any] | None = None,
which_to_keep: typing.Literal['hard', 'easy', 'random'] = 'hard',
sim_metric: typing.Literal['cosine', 'l2'] = 'cosine',
random_seed: int = 42
)
Dataclass

Bases: CompositeStage[_EmptyTask, FileGroupTask]

Pairwise similarity stage for semantic deduplication.

embedding_dim
int | None = None
embedding_field
str
id_field
str
input_path
str
output_path
str
pairwise_batch_size
int = 1024
random_seed
int = 42
ranking_strategy
RankingStrategy | None = None
read_kwargs
dict[str, Any] | None = None
sim_metric
Literal['cosine', 'l2'] = 'cosine'
verbose
bool = False
which_to_keep
Literal['hard', 'easy', 'random'] = 'hard'
write_kwargs
dict[str, Any] | None = None
nemo_curator.stages.deduplication.semantic.pairwise.PairwiseStage.__post_init__()

Initialize parent class after dataclass initialization.

nemo_curator.stages.deduplication.semantic.pairwise.PairwiseStage.decompose() -> list[nemo_curator.stages.base.ProcessingStage]
nemo_curator.stages.deduplication.semantic.pairwise.pairwise_cosine_similarity_batched(
cluster_reps: torch.Tensor,
batch_size: int = 1024
) -> tuple[cupy.ndarray, cupy.ndarray] | tuple[numpy.ndarray, numpy.ndarray]

Computes pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity. This function is useful for large clusters where the pairwise similarity matrix does not fit into memory. We use a batched approach to compute the pairwise similarity matrix in batches. Memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size instead of O(N^2) for the full matrix.

TODO: In future we can estimate memory requirement and calculate batch size dynamically.