modules.semantic_dedup.semanticclusterleveldedup
#
Module Contents#
Classes#
API#
- class modules.semantic_dedup.semanticclusterleveldedup.SemanticClusterLevelDedup(
- n_clusters: int = 1000,
- emb_by_clust_dir: str = './clustering_results/embs_by_nearest_center',
- id_column: str = 'id',
- which_to_keep: str = 'hard',
- sim_metric: Literal[cosine, l2] = 'cosine',
- output_dir: str = './clustering_results',
- embedding_column: str = 'embeddings',
- batched_cosine_similarity: int = 1024,
- logger: logging.Logger | str = './',
- profile_dir: str | None = None,
Initialization
Initialize the SemanticClusterLevelDedup class.
Args: n_clusters (int): Number of clusters. Default is 1000. emb_by_clust_dir (str): Directory containing embeddings by cluster. Default is “./clustering_results/embs_by_nearest_center”. id_column (str): Column name used as the identifier in the dataset. Default is “id”. which_to_keep (str): Method to determine which duplicates to keep. Default is “hard”. - hard retains edge-case or outlier items farthest from the centroid by sorting points by decreasing distance from the centroid. - easy retains representative items closest to the centroid by sorting points by increasing distance from the centroid. - random retains items randomly. sim_metric (“cosine” or “l2”): Similarity metric to use to rank within cluster. Default is “cosine”.
which_to_keep
determines how points within each cluster are ranked, based on the similarity to the centroid defined bysim_metric
output_dir (str): Directory to save output files. Default is “./clustering_results”. embedding_column (str): The column name that stores the embeddings. Default is “embeddings”. batched_cosine_similarity (int): Whether to use batched cosine similarity (has less memory usage). Default is 1024. When greater than 0, batching is used and memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size. When less than or equal to 0, no batching is used and memory requirements are O(N^2) where N is the number of items in the cluster. logger (Union[logging.Logger, str]): Existing logger to log to, or a path to a log directory. Default is “./”. profile_dir (Optional[str]): If specified, directory to write Dask profile. Default is None.- compute_semantic_match_dfs() None #
- extract_dedup_data(
- eps_to_extract: float,
Extract similar records that are within epsilon threshold. These records can be removed from the dataset. Args: eps_to_extract (float): Epsilon threshold for extracting deduplicated data. Returns: DocumentDataset: Dataset containing list of ids that are can be removed.