modules.semantic_dedup.clusteringmodel#

Module Contents#

Classes#

API#

class modules.semantic_dedup.clusteringmodel.ClusteringModel(
id_column: str = 'id',
max_iter: int = 100,
n_clusters: int = 1000,
clustering_output_dir: str = './clustering_results',
embedding_column: str = 'embeddings',
random_state: int = 1234,
clustering_input_partition_size: str | None = '2gb',
logger: logging.Logger | str = './',
profile_dir: str | None = None,
keep_all_columns: bool = False,
)#

Initialization

Initializes the ClusteringModel with the provided settings for semantic clustering to help semantic deduplication.

Args: id_column (str): Column name used as the identifier in the dataset. Default is “id”. max_iter (int): Maximum iterations for clustering. The more iterations, the better the clustering. Default is 100. n_clusters (int): Number of clusters. Default is 1000. clustering_output_dir (str): Location to save clustering results. Default is “./clustering_results”. embedding_column (str): The column name that stores the embeddings. Default is “embeddings”. random_state (int): KMeans random state used for reproducibility. Default is 1234. clustering_input_partition_size (Optional[str]): The size of data partition with which to run KMeans. Default is “2gb”. If None, then the dataset is not repartitioned. logger (Union[logging.Logger, str]): Existing logger to log to, or a path to a log directory. Default is “./”. profile_dir (Optional[str]): If specified, directory to write Dask profile. Default is None.