stages.client_partitioning#

Module Contents#

Classes#

ClientPartitioningStage

Stage that partitions input file paths from a client into FileGroupTasks.

API#

class stages.client_partitioning.ClientPartitioningStage#

Bases: nemo_curator.stages.file_partitioning.FilePartitioningStage

Stage that partitions input file paths from a client into FileGroupTasks.

This stage runs as a dedicated processing stage (not on the driver) and creates file groups based on the partitioning strategy.

input_list_json_path: str | None#

None

process(
_: nemo_curator.tasks._EmptyTask,
) list[nemo_curator.tasks.FileGroupTask]#

Process the initial task to create file group tasks.

This stage expects a simple Task with file paths information and outputs multiple FileGroupTasks for parallel processing.

setup(
worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None,
) None#

Setup method called once before processing begins. Override this method to perform any initialization that should happen once per worker. Args: worker_metadata (WorkerMetadata, optional): Information about the worker (provided by some backends)