Experimental Executors | NeMo Curator

NeMo Curator provides experimental executors for alternative execution backends. These are located in nemo_curator.backends.experimental.

Experimental executors are subject to change and may not have full feature parity with XennaExecutor.

RayDataExecutor was promoted from experimental in 26.04. Import it from nemo_curator.backends.ray_data. See Pipeline Execution Backends for details.

RayActorPoolExecutor

Uses Ray Actor Pool for distributed execution with built-in progress tracking.

Import

1 from nemo_curator.backends.experimental import RayActorPoolExecutor

Usage

1 executor = RayActorPoolExecutor(
2     config={
3         "pool_size": 8,
4     },
5     ignore_head_node=True,
6     show_progress=True,
7     progress_interval=10.0,
8 )
9 
10 results = pipeline.run(executor=executor)

Configuration

Option	Type	Default	Description
`config`	`dict \| None`	`None`	Executor-specific configuration dictionary
`ignore_head_node`	`bool`	`False`	Exclude head node from execution
`show_progress`	`bool`	`True`	Display tqdm progress bars during stage execution and shuffle inserts
`progress_interval`	`float`	`10.0`	Minimum interval in seconds between progress bar updates

BaseExecutor Interface

All executors inherit from BaseExecutor:

1 from abc import ABC, abstractmethod
2 from typing import Any
3 
4 class BaseExecutor(ABC):
5     """Base class for all executors."""
6 
7     def __init__(
8         self,
9         config: dict[str, Any] | None = None,
10         ignore_head_node: bool = False,
11     ) -> None:
12         """Initialize executor.
13 
14         Args:
15             config: Executor-specific configuration.
16             ignore_head_node: Exclude head node from execution.
17         """
18         self.config = config or {}
19         self.ignore_head_node = ignore_head_node
20 
21     @abstractmethod
22     def execute(
23         self,
24         stages: list[ProcessingStage],
25         initial_tasks: list[Task] | None = None,
26     ) -> list[Task]:
27         """Execute pipeline stages.
28 
29         Args:
30             stages: Processing stages to execute.
31             initial_tasks: Initial tasks (defaults to EmptyTask).
32 
33         Returns:
34             Output tasks from final stage.
35         """

Creating Custom Executors

1 from nemo_curator.backends.base import BaseExecutor
2 from nemo_curator.stages.base import ProcessingStage
3 from nemo_curator.tasks import Task
4 
5 class MyCustomExecutor(BaseExecutor):
6     """Custom executor implementation."""
7 
8     def execute(
9         self,
10         stages: list[ProcessingStage],
11         initial_tasks: list[Task] | None = None,
12     ) -> list[Task]:
13         tasks = initial_tasks or [EmptyTask()]
14 
15         for stage in stages:
16             stage.setup({})
17             new_tasks = []
18             for task in tasks:
19                 result = stage.process(task)
20                 if result is not None:
21                     if isinstance(result, list):
22                         new_tasks.extend(result)
23                     else:
24                         new_tasks.append(result)
25             stage.teardown()
26             tasks = new_tasks
27 
28         return tasks

Choosing an Executor

Executor	Best For	Considerations
`XennaExecutor`	Production workloads	Default choice, most stable
`RayDataExecutor`	Ray-native environments	Promoted from experimental in 26.04
`RayActorPoolExecutor`	Fine-grained actor control	Experimental

Source Code

View source on GitHub