modules.fuzzy_dedup.lsh#

Module Contents#

Classes#

LSH

Performs LSH on a MinhashSignatures

API#

class modules.fuzzy_dedup.lsh.LSH(
cache_dir: str,
num_hashes: int,
num_buckets: int,
buckets_per_shuffle: int = 1,
false_positive_check: bool = False,
logger: logging.LoggerAdapter | str = './',
id_fields: str | list = 'id',
minhash_field: str = '_minhash_signature',
profile_dir: str | None = None,
)#

Performs LSH on a MinhashSignatures

Initialization

Parameters

cache_dir: str Needs to be specified, will compute & write duplicate id, bucket pairs to cache directory. num_hashes: Length of minhash signature num_buckets: Number of bands/buckets to create from the minhash signature. Hashes_per_signature = num_hashes / num_buckets buckets_per_shuffle: Number of bands/buckets to shuffle concurrently. but might lead to memory pressures and related errors. false_positive_check: bool If True, writes out buckets in a format compatible with downstream false positive check. logger: Existing logger to log to, or a path to a log directory. id_field: Columns in the Dataset denoting document ID. minhash_field: Column in the Dataset denoting minhash signature. profile_dir: str, Default None If specified directory to write dask profile

bucket_id_to_int(
bucket_ddf: dask_cudf.DataFrame,
bucket_col_name: str = 'bucket_id',
start_id: int = 0,
) tuple[dask_cudf.DataFrame, int]#

Maps bucket ids to a contigious integer range from starting from start_id.

lsh(write_path: str, df: dask_cudf.DataFrame) bool#

Computes hash buckets for the DataFrame and writes them as parquet files to the specified path.

Parameters: - write_path (str): The directory path to write parquet files. - df (dask_cudf.DataFrame): The input DataFrame with minhashes to be bucketed. Returns: are_buckets_empty: True if buckets were empty (no duplicates found), False otherwise.

minhash_to_buckets(
df: cudf.DataFrame,
bucket_ranges: list[list[int]],
) cudf.DataFrame#