modules.semantic_dedup.semanticclusterleveldedup#

Module Contents#

Classes#

API#

class modules.semantic_dedup.semanticclusterleveldedup.SemanticClusterLevelDedup(
n_clusters: int = 1000,
emb_by_clust_dir: str = './clustering_results/embs_by_nearest_center',
id_column: str = 'id',
which_to_keep: str = 'hard',
sim_metric: Literal[cosine, l2] = 'cosine',
output_dir: str = './clustering_results',
embedding_column: str = 'embeddings',
batched_cosine_similarity: int = 1024,
logger: logging.Logger | str = './',
profile_dir: str | None = None,
)#

Initialization

Initialize the SemanticClusterLevelDedup class.

Args: n_clusters (int): Number of clusters. Default is 1000. emb_by_clust_dir (str): Directory containing embeddings by cluster. Default is “./clustering_results/embs_by_nearest_center”. id_column (str): Column name used as the identifier in the dataset. Default is “id”. which_to_keep (str): Method to determine which duplicates to keep. Default is “hard”. - hard retains edge-case or outlier items farthest from the centroid by sorting points by decreasing distance from the centroid. - easy retains representative items closest to the centroid by sorting points by increasing distance from the centroid. - random retains items randomly. sim_metric (“cosine” or “l2”): Similarity metric to use to rank within cluster. Default is “cosine”. which_to_keep determines how points within each cluster are ranked, based on the similarity to the centroid defined by sim_metric output_dir (str): Directory to save output files. Default is “./clustering_results”. embedding_column (str): The column name that stores the embeddings. Default is “embeddings”. batched_cosine_similarity (int): Whether to use batched cosine similarity (has less memory usage). Default is 1024. When greater than 0, batching is used and memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size. When less than or equal to 0, no batching is used and memory requirements are O(N^2) where N is the number of items in the cluster. logger (Union[logging.Logger, str]): Existing logger to log to, or a path to a log directory. Default is “./”. profile_dir (Optional[str]): If specified, directory to write Dask profile. Default is None.

compute_semantic_match_dfs() None#
extract_dedup_data(
eps_to_extract: float,
) nemo_curator.datasets.DocumentDataset#

Extract similar records that are within epsilon threshold. These records can be removed from the dataset. Args: eps_to_extract (float): Epsilon threshold for extracting deduplicated data. Returns: DocumentDataset: Dataset containing list of ids that are can be removed.