backends.experimental.ray_data.executor#

Module Contents#

Classes#

RayDataExecutor

Ray Data-based executor for pipeline execution.

API#

class backends.experimental.ray_data.executor.RayDataExecutor(config: dict[str, Any] | None = None)#

Bases: nemo_curator.backends.base.BaseExecutor

Ray Data-based executor for pipeline execution.

This executor:

  1. Executes setup on all nodes for all stages

  2. Converts initial tasks to Ray Data dataset

  3. Applies each stage as a Ray Data transformation (as a task or actor in map_batches)

  4. Returns final results as a list of tasks

Initialization

execute(
stages: list[nemo_curator.stages.base.ProcessingStage],
initial_tasks: list[nemo_curator.tasks.Task] | None = None,
) list[nemo_curator.tasks.Task]#

Execute the pipeline stages using Ray Data.

Args: stages (list[ProcessingStage]): List of processing stages to execute initial_tasks (list[Task], optional): Initial tasks to process (can be None for empty start)

Returns: list[Task]: List of final processed tasks