modules.config
#
Module Contents#
Classes#
Configuration for MinHash based fuzzy duplicates detection. Parameters |
|
Configuration for Semantic Deduplication. |
API#
- class modules.config.BaseConfig#
- classmethod from_yaml(file_path: str) modules.config.BaseConfig #
- class modules.config.FuzzyDuplicatesConfig#
Bases:
modules.config.BaseConfig
Configuration for MinHash based fuzzy duplicates detection. Parameters
seed: Seed for minhash permutations char_ngrams: Size of Char ngram shingles used in minhash computation num_buckets: Number of Bands or buckets to use during Locality Sensitive Hashing hashes_per_bucket: Number of hashes per bucket/band. use_64_bit_hash: Whether to use a 32bit or 64bit hash function for minhashing. buckets_per_shuffle: Number of bands/buckets to shuffle concurrently. Larger values process larger batches by processing multiple bands but might lead to memory pressures and related errors. id_field: Column in the Dataset denoting document ID. text_field: Column in the Dataset denoting document content. perform_removal: Boolean value to specify whether calling the module should remove the duplicates from the original dataset, or return the list of IDs denoting duplicates. profile_dir: str, Default None If specified directory to write dask profile cache_dir: str, Default None Location to store deduplcation intermediates such as minhashes/buckets etc. false_positive_check: bool, Whether to run a check to look for false positives within buckets. Note: This is a computationally expensive step. num_anchors: int Number of documents per bucket to use as reference for computing jaccard pairs within that bucket to identify false positives. jaccard_threshold: float The Jaccard similariy threshold to consider a document a near duplicate during false positive evaluations.
- bucket_mapping_blocksize: int | None#
None
- bucket_parts_per_worker: int | None#
None
- buckets_per_shuffle: int#
1
- cache_dir: str#
None
- char_ngrams: int#
24
- false_positive_check: bool#
False
- hashes_per_bucket: int#
13
- id_field: str#
‘id’
- jaccard_threshold: float | None#
None
- num_anchors: int | None#
None
- num_buckets: int#
20
- parts_per_worker: int | None#
None
- perform_removal: bool#
False
- profile_dir: str | None#
None
- seed: int#
42
- text_field: str#
‘text’
- use_64_bit_hash: bool#
False
- class modules.config.SemDedupConfig#
Bases:
modules.config.BaseConfig
Configuration for Semantic Deduplication.
Attributes: cache_dir (str): Directory to store cache. profile_dir (Optional[str]): If specified, directory to write Dask profile. Default is None. num_files (int): Number of files. Default is -1, meaning all files.
embedding_model_name_or_path (str): Model name or path for embeddings. Default is "sentence-transformers/all-MiniLM-L6-v2". embedding_batch_size (int): Initial batch size for processing embeddings. Default is 128. embeddings_save_loc (str): Location to save embeddings. Default is "embeddings". embedding_max_mem_gb (int): Maximum memory usage in GB for the embedding process. If None, it defaults to the available GPU memory minus 4 GB. embedding_pooling_strategy (str): Strategy for pooling embeddings, either "mean_pooling" or "last_token". Default is "mean_pooling". embedding_column (str): The column name that stores the embeddings. Default is "embeddings". write_embeddings_to_disk (bool): If True, saves the embeddings to disk. We recommend setting this to False when you have a delayed pipeline. Setting it to False can lead to more memory overhead. Default is True. write_to_filename (bool): If True, saves the embeddings to the same filename as input files. Default False. max_iter (int): Maximum iterations for clustering. The more iterations, the better the clustering. Default is 100. n_clusters (int): Number of clusters. Default is 1000. clustering_save_loc (str): Location to save clustering results. Default is "clustering_results". random_state (int): KMeans random state used for reproducibility. Default is 1234. sim_metric ("cosine" or "l2"): Similarity metric to use to rank within cluster. Default is "cosine". `which_to_keep` determines how points within each cluster are ranked, based on the similarity to the centroid defined by `sim_metric` which_to_keep (str): Method to determine which duplicates to keep. Default is "hard". - hard retains edge-case or outlier items farthest from the centroid by sorting points by decreasing distance from the centroid. - easy retains representative items closest to the centroid by sorting points by increasing distance from the centroid. - random retains items randomly. batched_cosine_similarity (Union[bool, int]): Whether to use batched cosine similarity (has less memory usage). Default is 1024. When False or 0, no batching is used and memory requirements are O(N^2) where N is the number of items in the cluster. When True, batch size is set to 1024 and memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size. clustering_input_partition_size (Optional[str]): The size of data partition with which to run KMeans. Default is "2gb". If None, then the dataset is not repartitioned. eps_to_extract (float): Epsilon value to extract deduplicated data. Default is 0.01.
- batched_cosine_similarity: bool | int#
1024
- cache_dir: str#
None
- clustering_input_partition_size: str#
‘2gb’
- clustering_save_loc: str#
‘clustering_results’
- embedding_batch_size: int#
128
- embedding_column: str#
‘embeddings’
- embedding_max_mem_gb: int | None#
None
- embedding_model_name_or_path: str#
‘sentence-transformers/all-MiniLM-L6-v2’
- embedding_pooling_strategy: str#
‘mean_pooling’
- embeddings_save_loc: str#
‘embeddings’
- eps_to_extract: float#
0.01
- max_iter: int#
100
- n_clusters: int#
1000
- num_files: int#
None
- profile_dir: str | None#
None
- random_state: int#
1234
- sim_metric: Literal[cosine, l2]#
‘cosine’
- which_to_keep: Literal[hard, easy, random]#
‘hard’
- write_embeddings_to_disk: bool#
True
- write_to_filename: bool#
False