nemo_curator.stages.deduplication.shuffle_utils.stage
nemo_curator.stages.deduplication.shuffle_utils.stage
Module Contents
Classes
API
Bases: ProcessingStage[FileGroupTask, FileGroupTask]
Stage that performs generic shuffling on specified columns from a FileGroupTask. This stage uses the BulkRapidsMPFShuffler with cuDF I/O for efficient GPU-based shuffling.
Parameters
shuffle_on List of column names to shuffle on. total_nparts Total number of output partitions. If None, will be set automatically by the executor. output_path Path to write output files. read_kwargs Keyword arguments for cudf.read_parquet method. write_kwargs Keyword arguments for cudf.to_parquet method. rmm_pool_size Size of the RMM GPU memory pool in bytes. If “auto”, the memory pool is set to 90% of the free GPU memory. If None, the memory pool is set to 50% of the free GPU memory that can expand if needed. spill_memory_limit Device memory limit in bytes for spilling to host. If “auto”, the limit is set to 80% of the RMM pool size. If None spilling is disabled. enable_statistics Whether the underlying rapidsmpf shuffler should collect shuffle statistics.
Verify the actor object is properly initialized.
Not implemented for actor-based stages.
Ray stage specification for this stage.
Read files and insert into shuffler.