modules.config#

Module Contents#

Classes#

BaseConfig

FuzzyDuplicatesConfig

Configuration for MinHash based fuzzy duplicates detection. Parameters

SemDedupConfig

Configuration for Semantic Deduplication.

API#

class modules.config.BaseConfig#
classmethod from_yaml(file_path: str) modules.config.BaseConfig#
class modules.config.FuzzyDuplicatesConfig#

Bases: modules.config.BaseConfig

Configuration for MinHash based fuzzy duplicates detection. Parameters

seed: Seed for minhash permutations char_ngrams: Size of Char ngram shingles used in minhash computation num_buckets: Number of Bands or buckets to use during Locality Sensitive Hashing hashes_per_bucket: Number of hashes per bucket/band. use_64_bit_hash: Whether to use a 32bit or 64bit hash function for minhashing. buckets_per_shuffle: Number of bands/buckets to shuffle concurrently. Larger values process larger batches by processing multiple bands but might lead to memory pressures and related errors. id_field: Column in the Dataset denoting document ID. text_field: Column in the Dataset denoting document content. perform_removal: Boolean value to specify whether calling the module should remove the duplicates from the original dataset, or return the list of IDs denoting duplicates. profile_dir: str, Default None If specified directory to write dask profile cache_dir: str, Default None Location to store deduplcation intermediates such as minhashes/buckets etc. false_positive_check: bool, Whether to run a check to look for false positives within buckets. Note: This is a computationally expensive step. num_anchors: int Number of documents per bucket to use as reference for computing jaccard pairs within that bucket to identify false positives. jaccard_threshold: float The Jaccard similariy threshold to consider a document a near duplicate during false positive evaluations.

bucket_mapping_blocksize: int | None#

None

bucket_parts_per_worker: int | None#

None

buckets_per_shuffle: int#

1

cache_dir: str#

None

char_ngrams: int#

24

false_positive_check: bool#

False

hashes_per_bucket: int#

13

id_field: str#

‘id’

jaccard_threshold: float | None#

None

num_anchors: int | None#

None

num_buckets: int#

20

parts_per_worker: int | None#

None

perform_removal: bool#

False

profile_dir: str | None#

None

seed: int#

42

text_field: str#

‘text’

use_64_bit_hash: bool#

False

class modules.config.SemDedupConfig#

Bases: modules.config.BaseConfig

Configuration for Semantic Deduplication.

Attributes: cache_dir (str): Directory to store cache. profile_dir (Optional[str]): If specified, directory to write Dask profile. Default is None. num_files (int): Number of files. Default is -1, meaning all files.

embedding_model_name_or_path (str): Model name or path for embeddings.
    Default is "sentence-transformers/all-MiniLM-L6-v2".
embedding_batch_size (int): Initial batch size for processing embeddings.
    Default is 128.
embeddings_save_loc (str): Location to save embeddings.
    Default is "embeddings".
embedding_max_mem_gb (int): Maximum memory usage in GB for the embedding process.
    If None, it defaults to the available GPU memory minus 4 GB.
embedding_pooling_strategy (str): Strategy for pooling embeddings, either
    "mean_pooling" or "last_token". Default is "mean_pooling".
embedding_column (str): The column name that stores the embeddings.
    Default is "embeddings".
write_embeddings_to_disk (bool): If True, saves the embeddings to disk.
    We recommend setting this to False when you have a delayed pipeline.
    Setting it to False can lead to more memory overhead. Default is True.
write_to_filename (bool): If True, saves the embeddings to the same filename as input files.
    Default False.

max_iter (int): Maximum iterations for clustering. The more iterations, the better the clustering.
    Default is 100.
n_clusters (int): Number of clusters. Default is 1000.
clustering_save_loc (str): Location to save clustering results.
    Default is "clustering_results".
random_state (int): KMeans random state used for reproducibility. Default is 1234.
sim_metric ("cosine" or "l2"): Similarity metric to use to rank within cluster. Default is "cosine".
    `which_to_keep` determines how points within each cluster are ranked, based on the similarity to the centroid defined by `sim_metric`
which_to_keep (str): Method to determine which duplicates to keep. Default is "hard".
    - hard retains edge-case or outlier items farthest from the centroid by sorting points by decreasing distance from the centroid.
    - easy retains representative items closest to the centroid by sorting points by increasing distance from the centroid.
    - random retains items randomly.
batched_cosine_similarity (Union[bool, int]): Whether to use batched cosine similarity (has less memory usage).
    Default is 1024. When False or 0, no batching is used and memory requirements are O(N^2) where N is the number of items in the cluster.
    When True, batch size is set to 1024 and memory requirements are O(N*B) where N is the number of items in the cluster and B is the batch size.
clustering_input_partition_size (Optional[str]): The size of data partition with which to run KMeans.
    Default is "2gb". If None, then the dataset is not repartitioned.

eps_to_extract (float): Epsilon value to extract deduplicated data.
    Default is 0.01.
batched_cosine_similarity: bool | int#

1024

cache_dir: str#

None

clustering_input_partition_size: str#

‘2gb’

clustering_save_loc: str#

‘clustering_results’

embedding_batch_size: int#

128

embedding_column: str#

‘embeddings’

embedding_max_mem_gb: int | None#

None

embedding_model_name_or_path: str#

‘sentence-transformers/all-MiniLM-L6-v2’

embedding_pooling_strategy: str#

‘mean_pooling’

embeddings_save_loc: str#

‘embeddings’

eps_to_extract: float#

0.01

max_iter: int#

100

n_clusters: int#

1000

num_files: int#

None

profile_dir: str | None#

None

random_state: int#

1234

sim_metric: Literal[cosine, l2]#

‘cosine’

which_to_keep: Literal[hard, easy, random]#

‘hard’

write_embeddings_to_disk: bool#

True

write_to_filename: bool#

False