utils.semdedup_utils#

Module Contents#

Functions#

add_l2_cosine_dist_to_centroid

Computes the L2 distance to nearest centroid to each embedding in the DataFrame. Embeddings are normalized. For cosine we’ll need to normalize the centroids as well.

get_array_from_df

get_semantic_matches_per_cluster

Get the semantic matches for a single cluster. Reads the cluster embeddings and then computes pairwise cosine similarity between them.

normalize_embeddings_col_in_df

pairwise_cosine_similarity

Compute pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity

pairwise_cosine_similarity_batched

Computes pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity. This function is useful for large clusters where the pairwise similarity matrix does not fit into memory. We use a batched approach to compute the pairwise similarity matrix in batches. Memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size instead of O(N^2) for the full matrix.

prune_single_cluster

Processes data for a single cluster, applying pruning based on specified epsilon.

write_pruned_summary_file

Writes a summary file for the pruned data.

Data#

API#

utils.semdedup_utils.COSINE_DIST_TO_CENT_COL#

‘cosine_dist_to_cent’

utils.semdedup_utils.L2_DIST_TO_CENT_COL#

‘l2_dist_to_cent’

utils.semdedup_utils.add_l2_cosine_dist_to_centroid(
df: cudf.DataFrame,
embedding_col: str,
centroids: cupy.ndarray,
) cudf.DataFrame#

Computes the L2 distance to nearest centroid to each embedding in the DataFrame. Embeddings are normalized. For cosine we’ll need to normalize the centroids as well.

utils.semdedup_utils.get_array_from_df(
df: cudf.DataFrame,
embedding_col: str,
) cupy.ndarray#
utils.semdedup_utils.get_semantic_matches_per_cluster(
cluster_id: int,
emb_by_clust_dir: str,
id_col: str,
output_dir: str,
embedding_col: str,
which_to_keep: Literal[hard, easy, random],
sim_metric: Literal[cosine, l2],
batched_cosine_similarity: int = 1024,
) None#

Get the semantic matches for a single cluster. Reads the cluster embeddings and then computes pairwise cosine similarity between them.

utils.semdedup_utils.normalize_embeddings_col_in_df(
df: cudf.DataFrame,
embedding_col: str,
) cudf.DataFrame#
utils.semdedup_utils.pairwise_cosine_similarity(
cluster_reps: torch.Tensor,
device: Literal[cuda, cpu],
) tuple[cupy.ndarray, cupy.ndarray]#

Compute pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity

utils.semdedup_utils.pairwise_cosine_similarity_batched(
cluster_reps: torch.Tensor,
device: Literal[cuda, cpu],
batch_size: int = 1024,
) tuple[cupy.ndarray, cupy.ndarray]#

Computes pairwise cosine similarity between cluster items, then replace to diagonal with zeros to ignore self similarity. This function is useful for large clusters where the pairwise similarity matrix does not fit into memory. We use a batched approach to compute the pairwise similarity matrix in batches. Memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size instead of O(N^2) for the full matrix.

utils.semdedup_utils.prune_single_cluster(
cluster_id: int,
id_col: str,
emb_by_clust_dir: str,
semdedup_pruning_tables_dir: str,
eps: float,
) cudf.DataFrame#

Processes data for a single cluster, applying pruning based on specified epsilon.

Args: cluster_id (int): The specific cluster ID to process. id_col (str): The name of the ID column. emb_by_clust_dir (str): Path to where clustered embeddings are stored. semdedup_pruning_tables_dir (str): Path to the pruning tables directory. eps (float): Epsilon value for pruning.

Returns: cudf.DataFrame: A DataFrame of the pruned cluster data

utils.semdedup_utils.write_pruned_summary_file(
eps: float,
emb_by_clust_dir: str,
filtered_unique_ids_path: str,
output_summary_file: str,
logger: logging.Logger,
) None#

Writes a summary file for the pruned data.