nemo_curator.stages.deduplication.fuzzy.lsh.stage
nemo_curator.stages.deduplication.fuzzy.lsh.stage
Module Contents
Classes
API
Bases: ProcessingStage[FileGroupTask, FileGroupTask]
Stage that performs LSH on a FileGroupTask containing minhash data.
The executor will process this stage in iterations based on bands_per_iteration.
Parameters
num_bands Number of LSH bands. minhashes_per_band Number of minhashes per band. id_field Name of the ID field in input data. minhash_field Name of the minhash field in input data. output_path Base path to write output files. read_kwargs Keyword arguments for the read method. write_kwargs Keyword arguments for the write method. rmm_pool_size Size of the RMM GPU memory pool in bytes. If “auto”, the memory pool is set to 90% of the free GPU memory. If None, the memory pool is set to 50% of the free GPU memory that can expand if needed. spill_memory_limit Device memory limit in bytes for spilling to host. If “auto”, the limit is set to 80% of the RMM pool size. If None spilling is disabled. enable_statistics Whether to collect statistics. bands_per_iteration Number of bands to process per shuffle iteration. Between 1 and num_bands. Higher values reduce the number of shuffle iterations but increase the memory usage. total_nparts Total number of partitions to write during the shuffle. If None, the number of partitions will be decided automatically by the executor as the closest power of 2 <= number of input tasks.
Get all band ranges for iteration.
Ray stage specification for this stage.