Execution Backends | NeMo Curator

Configure and optimize execution backends to run NeMo Curator pipelines efficiently across single machines, multi-GPU systems, and distributed clusters.

Overview

Execution backends (executors) are the engines that run NeMo Curator Pipeline workflows across your compute resources. They handle:

Task Distribution: Distribute pipeline stages across available workers and GPUs
Resource Management: Allocate CPU, GPU, and memory resources to processing tasks
Scaling: Automatically or manually scale processing based on workload
Data Movement: Optimize data transfer between pipeline stages

Choosing the right executor impacts:

Pipeline performance and throughput
Resource utilization efficiency
Ease of deployment and monitoring

This guide covers all execution backends available in NeMo Curator and applies to all modalities: text, image, video, and audio curation.

Basic Usage Pattern

All pipelines follow this standard execution pattern:

1 from nemo_curator.pipeline import Pipeline
2 
3 pipeline = Pipeline(name="example_pipeline", description="Curator pipeline")
4 pipeline.add_stage(...)
5 
6 # Choose an executor below and run
7 results = pipeline.run(executor)

Key points:

The same pipeline definition works with any executor
Executor choice is independent of pipeline stages
Switch executors without changing pipeline code

Available Backends

`XennaExecutor` (recommended)

XennaExecutor uses Cosmos-Xenna, a Ray-based execution engine optimized for distributed data processing. Xenna provides native streaming support, automatic resource scaling, and built-in fault tolerance. This executor is the recommended choice for most workloads, especially for video and multimodal pipelines.

Key Features:

Streaming execution: Process data incrementally as it arrives, reducing memory requirements
Auto-scaling: Dynamically adjusts worker allocation based on stage throughput
Fault tolerance: Built-in error handling and recovery mechanisms
Resource optimization: Efficient CPU and GPU allocation for video/multimodal workloads

1 from nemo_curator.backends.xenna import XennaExecutor
2 
3 executor = XennaExecutor(
4     config={
5         # Execution mode: 'streaming' (default) or 'batch'
6         # Batch processes all data for a stage before moving to the next; streaming runs stages concurrently.
7         "execution_mode": "streaming",
8         
9         # Logging interval: seconds between status logs (default: 60)
10         # Controls how frequently progress updates are printed
11         "logging_interval": 60,
12         
13         # Ignore failures: whether to continue on failures (default: False)
14         # When True, the pipeline continues execution instead of failing fast when stages raise errors.
15         "ignore_failures": False,
16         
17         # CPU allocation percentage: ratio of CPU to allocate (0-1, default: 0.95)
18         # Fraction of available CPU resources to use for pipeline execution
19         "cpu_allocation_percentage": 0.95,
20         
21         # Autoscale interval: seconds between auto-scaling checks (default: 180)
22         # How often to run the stage auto-scaler.
23         "autoscale_interval_s": 180,
24         
25         # Max workers per stage: maximum number of workers (optional)
26         # Limits worker count per stage; None means no limit
27         "max_workers_per_stage": None,
28     }
29 )
30 
31 results = pipeline.run(executor)

Configuration Parameters:

Parameter	Type	Default	Description
`execution_mode`	`str`	`"streaming"`	Execution mode: `"streaming"` for incremental processing or `"batch"` for full dataset processing
`logging_interval`	`int`	`60`	Seconds between status log updates
`ignore_failures`	`bool`	`False`	If `True`, continue pipeline execution even when stages fail
`cpu_allocation_percentage`	`float`	`0.95`	Fraction (0-1) of available CPU resources to allocate
`autoscale_interval_s`	`int`	`180`	Seconds between auto-scaling evaluations
`max_workers_per_stage`	`int \| None`	`None`	Maximum workers per stage; `None` means no limit

For more details, refer to the official NVIDIA Cosmos-Xenna project.

`RayActorPoolExecutor`

RayActorPoolExecutor uses Ray’s ActorPool for efficient distributed processing with fine-grained resource management. This executor creates pools of Ray actors per stage, enabling better load balancing and fault tolerance through Ray’s native mechanisms. Deduplication workflows automatically use this executor for GPU-accelerated stages.

Key Features:

ActorPool-based execution: Creates dedicated actor pools per stage for optimal resource utilization
Load balancing: Uses map_unordered for efficient work distribution across actors
Progress tracking: Built-in tqdm progress bars for real-time visibility into task completion
RAFT support: Native integration with RAFT (RAPIDS Analytics Framework Toolbox) for GPU-accelerated clustering and nearest-neighbor operations
Head node exclusion: Optional ignore_head_node parameter to reserve the Ray cluster’s head node for coordination tasks only

1 from nemo_curator.backends.experimental import RayActorPoolExecutor
2 
3 executor = RayActorPoolExecutor(
4     show_progress=True,       # Display tqdm progress bars (default: True)
5     progress_interval=10.0,   # Minimum seconds between progress bar updates (default: 10.0)
6     ignore_head_node=True,
7 )
8 
9 results = pipeline.run(executor)

Configuration Parameters:

Parameter	Type	Default	Description
`config`	`dict \| None`	`None`	Executor-specific configuration dictionary
`ignore_head_node`	`bool`	`False`	Exclude head node from task scheduling
`show_progress`	`bool`	`True`	Display tqdm progress bars during stage execution and shuffle inserts
`progress_interval`	`float`	`10.0`	Minimum interval in seconds between progress bar updates

Example: Fuzzy Deduplication

1 from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
2 
3 workflow = FuzzyDeduplicationWorkflow(
4     input_path="/data/documents",
5     cache_path="/data/cache",
6     output_path="/data/output",
7     text_field="text",
8     perform_removal=True,
9     num_bands=20,
10     minhashes_per_band=13,
11 )
12 
13 # The workflow automatically uses RayActorPoolExecutor for GPU-accelerated stages
14 results = workflow.run()

For more details, refer to Text Deduplication .

`RayDataExecutor`

RayDataExecutor uses Ray Data, a scalable data processing library built on Ray Core. Ray Data provides a familiar DataFrame-like API for distributed data transformations. This executor is best suited for large-scale text processing tasks that benefit from Ray Data’s optimized data loading and transformation pipelines.

Key Features:

Ray Data API: Leverages Ray Data’s optimized data processing primitives
Scalable transformations: Efficient map-batch operations across distributed workers

1 from nemo_curator.backends.ray_data import RayDataExecutor
2 
3 executor = RayDataExecutor(
4     config={"ignore_failures": False},
5     ignore_head_node=True,  # Exclude head node from computation
6 )
7 results = pipeline.run(executor)

Constructor Parameters:

Parameter	Type	Default	Description
`config`	`dict`	`{}`	Configuration dictionary for Ray Data execution (see config keys below)
`ignore_head_node`	`bool`	`False`	Exclude the Ray cluster’s head node from execution

Config Dictionary Keys (passed via config={...}):

Key	Type	Default	Description
`ignore_failures`	`bool`	`False`	If `True`, continue pipeline execution even when tasks fail

Per-Stage Runtime Environments

All three backends support per-stage runtime environments, which allow individual stages to declare isolated Python dependencies. When a stage sets a runtime_env, the backend forwards it to Ray so that each stage’s workers run in a dedicated virtualenv. This enables pipelines where stages require incompatible library versions.

See the Per-Stage Runtime Environments reference for configuration details and examples.

Ray Executors in Practice

Ray-based executors provide enhanced scalability and performance for large-scale data processing tasks. These executors are beneficial for:

Large-scale classification tasks: Distributed inference across multi-GPU setups
Deduplication workflows: Parallel processing of document similarity computations
Resource-intensive stages: Automatic scaling based on computational demands

Choosing a Backend

All executors can deliver strong performance; choose based on your workload requirements:

XennaExecutor: Default for most workloads due to maturity and extensive real-world usage (including video pipelines); supports streaming and batch execution with auto-scaling.
RayActorPoolExecutor: Automatically used for deduplication workflows; provides GPU-accelerated processing with RAFT integration.
RayDataExecutor: Best for batch data transformations using Ray Data’s DataFrame-like API.

Minimal End-to-End example

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.backends.xenna import XennaExecutor
3 
4 # Build your pipeline
5 pipeline = Pipeline(name="curator_pipeline")
6 # pipeline.add_stage(stage1)
7 # pipeline.add_stage(stage2)
8 
9 # Run with Xenna (recommended)
10 executor = XennaExecutor(config={"execution_mode": "streaming"})
11 results = pipeline.run(executor)
12 
13 print(f"Completed with {len(results) if results else 0} output tasks")

Overview

Basic Usage Pattern

Available Backends

XennaExecutor (recommended)

RayActorPoolExecutor

Example: Fuzzy Deduplication

RayDataExecutor

Per-Stage Runtime Environments

Ray Executors in Practice

Choosing a Backend

Minimal End-to-End example

`XennaExecutor` (recommended)

`RayActorPoolExecutor`

`RayDataExecutor`