modules.fuzzy_dedup.lsh
#
Module Contents#
Classes#
Performs LSH on a MinhashSignatures |
API#
- class modules.fuzzy_dedup.lsh.LSH(
- cache_dir: str,
- num_hashes: int,
- num_buckets: int,
- buckets_per_shuffle: int = 1,
- false_positive_check: bool = False,
- logger: logging.LoggerAdapter | str = './',
- id_fields: str | list = 'id',
- minhash_field: str = '_minhash_signature',
- profile_dir: str | None = None,
Performs LSH on a MinhashSignatures
Initialization
Parameters
cache_dir: str Needs to be specified, will compute & write duplicate id, bucket pairs to cache directory. num_hashes: Length of minhash signature num_buckets: Number of bands/buckets to create from the minhash signature. Hashes_per_signature = num_hashes / num_buckets buckets_per_shuffle: Number of bands/buckets to shuffle concurrently. but might lead to memory pressures and related errors. false_positive_check: bool If True, writes out buckets in a format compatible with downstream false positive check. logger: Existing logger to log to, or a path to a log directory. id_field: Columns in the Dataset denoting document ID. minhash_field: Column in the Dataset denoting minhash signature. profile_dir: str, Default None If specified directory to write dask profile
- bucket_id_to_int(
- bucket_ddf: dask_cudf.DataFrame,
- bucket_col_name: str = 'bucket_id',
- start_id: int = 0,
Maps bucket ids to a contigious integer range from starting from start_id.
- lsh(write_path: str, df: dask_cudf.DataFrame) bool #
Computes hash buckets for the DataFrame and writes them as parquet files to the specified path.
Parameters: - write_path (str): The directory path to write parquet files. - df (dask_cudf.DataFrame): The input DataFrame with minhashes to be bucketed. Returns: are_buckets_empty: True if buckets were empty (no duplicates found), False otherwise.
- minhash_to_buckets(
- df: cudf.DataFrame,
- bucket_ranges: list[list[int]],