modules.fuzzy_dedup.bucketstoedges#

Module Contents#

Classes#

BucketsToEdges

Maps buckets generated from LSH into an edgelist that can be processed further by Connected Components to find duplicate documents

API#

class modules.fuzzy_dedup.bucketstoedges.BucketsToEdges(
cache_dir: str | None = None,
id_fields: list | str = 'id',
str_id_name: str = 'id',
bucket_field: str = '_bucket_id',
logger: logging.LoggerAdapter | str = './',
profile_dir: str | None = None,
)#

Maps buckets generated from LSH into an edgelist that can be processed further by Connected Components to find duplicate documents

Initialization

Parameters

cache_dir: str or None If specified, will compute & write the edgelist to a file id_fields: list or str id fields of documents in buckets_df str_id_name: str Ignored if there is a single id field. Multiple id fields will be combined into a single id field with the given name. bucket_field: str Column denoting bucket ID num_buckets: Number of bands/buckets to create from the minhash signature. Hashes_per_signature = num_hashes / num_buckets

buckets_to_edges(buckets_df: cudf.DataFrame) cudf.DataFrame#