`utils.semdedup_utils`#

Module Contents#

Functions#

`add_l2_cosine_dist_to_centroid`	Computes the L2 distance to nearest centroid to each embedding in the DataFrame. Embeddings are normalized. For cosine we’ll need to normalize the centroids as well.
`get_array_from_df`
`get_semantic_matches_per_cluster`	Get the semantic matches for a single cluster. Reads the cluster embeddings and then computes pairwise cosine similarity between them.
`normalize_embeddings_col_in_df`
`pairwise_cosine_similarity`	Compute pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity
`pairwise_cosine_similarity_batched`	Computes pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity. This function is useful for large clusters where the pairwise similarity matrix does not fit into memory. We use a batched approach to compute the pairwise similarity matrix in batches. Memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size instead of O(N^2) for the full matrix.
`prune_single_cluster`	Processes data for a single cluster, applying pruning based on specified epsilon.
`write_pruned_summary_file`	Writes a summary file for the pruned data.

Data#

`COSINE_DIST_TO_CENT_COL`
`L2_DIST_TO_CENT_COL`

API#

utils.semdedup_utils.COSINE_DIST_TO_CENT_COL#: ‘cosine_dist_to_cent’

utils.semdedup_utils.L2_DIST_TO_CENT_COL#: ‘l2_dist_to_cent’

utils.semdedup_utils.add_l2_cosine_dist_to_centroid( df: cudf.DataFrame, embedding_col: str, centroids: cupy.ndarray, ) → cudf.DataFrame#: Computes the L2 distance to nearest centroid to each embedding in the DataFrame. Embeddings are normalized. For cosine we’ll need to normalize the centroids as well.

utils.semdedup_utils.get_array_from_df( df: cudf.DataFrame, embedding_col: str, ) → cupy.ndarray#

utils.semdedup_utils.get_semantic_matches_per_cluster( cluster_id: int, emb_by_clust_dir: str, id_col: str, output_dir: str, embedding_col: str, which_to_keep: Literal[hard, easy, random], sim_metric: Literal[cosine, l2], batched_cosine_similarity: int = 1024, ) → None#: Get the semantic matches for a single cluster. Reads the cluster embeddings and then computes pairwise cosine similarity between them.

utils.semdedup_utils.normalize_embeddings_col_in_df( df: cudf.DataFrame, embedding_col: str, ) → cudf.DataFrame#

utils.semdedup_utils.pairwise_cosine_similarity( cluster_reps: torch.Tensor, device: Literal[cuda, cpu], ) → tuple[cupy.ndarray, cupy.ndarray]#: Compute pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity

utils.semdedup_utils.pairwise_cosine_similarity_batched( cluster_reps: torch.Tensor, device: Literal[cuda, cpu], batch_size: int = 1024, ) → tuple[cupy.ndarray, cupy.ndarray]#: Computes pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity. This function is useful for large clusters where the pairwise similarity matrix does not fit into memory. We use a batched approach to compute the pairwise similarity matrix in batches. Memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size instead of O(N^2) for the full matrix.

utils.semdedup_utils.prune_single_cluster( cluster_id: int, id_col: str, emb_by_clust_dir: str, semdedup_pruning_tables_dir: str, eps: float, ) → cudf.DataFrame#

Processes data for a single cluster, applying pruning based on specified epsilon.

Args: cluster_id (int): The specific cluster ID to process. id_col (str): The name of the ID column. emb_by_clust_dir (str): Path to where clustered embeddings are stored. semdedup_pruning_tables_dir (str): Path to the pruning tables directory. eps (float): Epsilon value for pruning.

Returns: cudf.DataFrame: A DataFrame of the pruned cluster data

utils.semdedup_utils.write_pruned_summary_file( eps: float, emb_by_clust_dir: str, filtered_unique_ids_path: str, output_summary_file: str, logger: logging.Logger, ) → None#: Writes a summary file for the pruned data.

utils.semdedup_utils#

Module Contents#

Functions#

Data#

API#

`utils.semdedup_utils`#