`modules.semantic_dedup.semanticclusterleveldedup`#

Module Contents#

Classes#

SemanticClusterLevelDedup

API#

class modules.semantic_dedup.semanticclusterleveldedup.SemanticClusterLevelDedup( n_clusters: int = 1000, emb_by_clust_dir: str = './clustering_results/embs_by_nearest_center', id_column: str = 'id', which_to_keep: str = 'hard', sim_metric: Literal[cosine, l2] = 'cosine', output_dir: str = './clustering_results', embedding_column: str = 'embeddings', batched_cosine_similarity: int = 1024, logger: logging.Logger | str = './', profile_dir: str | None = None, )#

Initialization

Initialize the SemanticClusterLevelDedup class.

Args: n_clusters (int): Number of clusters. Default is 1000. emb_by_clust_dir (str): Directory containing embeddings by cluster. Default is “./clustering_results/embs_by_nearest_center”. id_column (str): Column name used as the identifier in the dataset. Default is “id”. which_to_keep (str): Method to determine which duplicates to keep. Default is “hard”. - hard retains edge-case or outlier items farthest from the centroid by sorting points by decreasing distance from the centroid. - easy retains representative items closest to the centroid by sorting points by increasing distance from the centroid. - random retains items randomly. sim_metric (“cosine” or “l2”): Similarity metric to use to rank within cluster. Default is “cosine”. which_to_keep determines how points within each cluster are ranked, based on the similarity to the centroid defined by sim_metric output_dir (str): Directory to save output files. Default is “./clustering_results”. embedding_column (str): The column name that stores the embeddings. Default is “embeddings”. batched_cosine_similarity (int): Whether to use batched cosine similarity (has less memory usage). Default is 1024. When greater than 0, batching is used and memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size. When less than or equal to 0, no batching is used and memory requirements are O(N^2) where N is the number of items in the cluster. logger (Union[logging.Logger, str]): Existing logger to log to, or a path to a log directory. Default is “./”. profile_dir (Optional[str]): If specified, directory to write Dask profile. Default is None.

compute_semantic_match_dfs() → None#

extract_dedup_data( eps_to_extract: float, ) → nemo_curator.datasets.DocumentDataset#: Extract similar records that are within epsilon threshold. These records can be removed from the dataset. Args: eps_to_extract (float): Epsilon threshold for extracting deduplicated data. Returns: DocumentDataset: Dataset containing list of ids that are can be removed.

modules.semantic_dedup.semanticclusterleveldedup#

Module Contents#

Classes#

API#

`modules.semantic_dedup.semanticclusterleveldedup`#