`modules.fuzzy_dedup.lsh`#

Module Contents#

Classes#

LSH

Performs LSH on a MinhashSignatures

API#

class modules.fuzzy_dedup.lsh.LSH( cache_dir: str, num_hashes: int, num_buckets: int, buckets_per_shuffle: int = 1, false_positive_check: bool = False, logger: logging.LoggerAdapter | str = './', id_fields: str | list = 'id', minhash_field: str = '_minhash_signature', profile_dir: str | None = None, )#

Performs LSH on a MinhashSignatures

Initialization

Parameters

cache_dir: str Needs to be specified, will compute & write duplicate id, bucket pairs to cache directory. num_hashes: Length of minhash signature num_buckets: Number of bands/buckets to create from the minhash signature. Hashes_per_signature = num_hashes / num_buckets buckets_per_shuffle: Number of bands/buckets to shuffle concurrently. but might lead to memory pressures and related errors. false_positive_check: bool If True, writes out buckets in a format compatible with downstream false positive check. logger: Existing logger to log to, or a path to a log directory. id_field: Columns in the Dataset denoting document ID. minhash_field: Column in the Dataset denoting minhash signature. profile_dir: str, Default None If specified directory to write dask profile

bucket_id_to_int( bucket_ddf: dask_cudf.DataFrame, bucket_col_name: str = 'bucket_id', start_id: int = 0, ) → tuple[dask_cudf.DataFrame, int]#: Maps bucket ids to a contigious integer range from starting from start_id.

lsh(write_path: str, df: dask_cudf.DataFrame) → bool#

Computes hash buckets for the DataFrame and writes them as parquet files to the specified path.

Parameters: - write_path (str): The directory path to write parquet files. - df (dask_cudf.DataFrame): The input DataFrame with minhashes to be bucketed. Returns: are_buckets_empty: True if buckets were empty (no duplicates found), False otherwise.

minhash_to_buckets( df: cudf.DataFrame, bucket_ranges: list[list[int]], ) → cudf.DataFrame#

modules.fuzzy_dedup.lsh#

Module Contents#

Classes#

API#

`modules.fuzzy_dedup.lsh`#