modules.fuzzy_dedup.bucketstoedges
#
Module Contents#
Classes#
Maps buckets generated from LSH into an edgelist that can be processed further by Connected Components to find duplicate documents |
API#
- class modules.fuzzy_dedup.bucketstoedges.BucketsToEdges(
- cache_dir: str | None = None,
- id_fields: list | str = 'id',
- str_id_name: str = 'id',
- bucket_field: str = '_bucket_id',
- logger: logging.LoggerAdapter | str = './',
- profile_dir: str | None = None,
Maps buckets generated from LSH into an edgelist that can be processed further by Connected Components to find duplicate documents
Initialization
Parameters
cache_dir: str or None If specified, will compute & write the edgelist to a file id_fields: list or str id fields of documents in buckets_df str_id_name: str Ignored if there is a single id field. Multiple id fields will be combined into a single id field with the given name. bucket_field: str Column denoting bucket ID num_buckets: Number of bands/buckets to create from the minhash signature. Hashes_per_signature = num_hashes / num_buckets
- buckets_to_edges(buckets_df: cudf.DataFrame) cudf.DataFrame #