nemo_curator.backends.ray_data.executor

View as Markdown

Module Contents

Classes

NameDescription
RayDataExecutorRay Data-based executor for pipeline execution.

API

class nemo_curator.backends.ray_data.executor.RayDataExecutor(
config: dict[str, typing.Any] | None = None,
ignore_head_node: bool = False
)

Bases: BaseExecutor

Ray Data-based executor for pipeline execution.

This executor:

  1. Executes setup on all nodes for all stages
  2. Converts initial tasks to Ray Data dataset
  3. Applies each stage as a Ray Data transformation (as a task or actor in map_batches)
  4. Returns final results as a list of tasks
nemo_curator.backends.ray_data.executor.RayDataExecutor._dataset_to_tasks(
dataset: ray.data.Dataset
) -> list[nemo_curator.tasks.Task]

Convert Ray Data dataset back to list of tasks.

Parameters:

dataset
Dataset

Ray Data dataset containing Task objects

Returns: list[Task]

List of Task objects

nemo_curator.backends.ray_data.executor.RayDataExecutor._tasks_to_dataset(
tasks: list[nemo_curator.tasks.Task]
) -> ray.data.Dataset

Convert list of tasks to Ray Data dataset.

Parameters:

tasks
list[Task]

List of Task objects

Returns: Dataset

Ray Data dataset containing Task objects directly

nemo_curator.backends.ray_data.executor.RayDataExecutor.execute(
stages: list[nemo_curator.stages.base.ProcessingStage],
initial_tasks: list[nemo_curator.tasks.Task] | None = None
) -> list[nemo_curator.tasks.Task]

Execute the pipeline stages using Ray Data.

Parameters:

stages
list[ProcessingStage]

List of processing stages to execute

initial_tasks
list[Task]Defaults to None

Initial tasks to process (can be None for empty start)

Returns: list[Task]

list[Task]: List of final processed tasks