modules.semantic_dedup.clusteringmodel
#
Module Contents#
Classes#
API#
- class modules.semantic_dedup.clusteringmodel.ClusteringModel(
- id_column: str = 'id',
- max_iter: int = 100,
- n_clusters: int = 1000,
- clustering_output_dir: str = './clustering_results',
- embedding_column: str = 'embeddings',
- random_state: int = 1234,
- clustering_input_partition_size: str | None = '2gb',
- logger: logging.Logger | str = './',
- profile_dir: str | None = None,
- keep_all_columns: bool = False,
Initialization
Initializes the ClusteringModel with the provided settings for semantic clustering to help semantic deduplication.
Args: id_column (str): Column name used as the identifier in the dataset. Default is “id”. max_iter (int): Maximum iterations for clustering. The more iterations, the better the clustering. Default is 100. n_clusters (int): Number of clusters. Default is 1000. clustering_output_dir (str): Location to save clustering results. Default is “./clustering_results”. embedding_column (str): The column name that stores the embeddings. Default is “embeddings”. random_state (int): KMeans random state used for reproducibility. Default is 1234. clustering_input_partition_size (Optional[str]): The size of data partition with which to run KMeans. Default is “2gb”. If None, then the dataset is not repartitioned. logger (Union[logging.Logger, str]): Existing logger to log to, or a path to a log directory. Default is “./”. profile_dir (Optional[str]): If specified, directory to write Dask profile. Default is None.