Pipeline Execution Backends
Configure and optimize execution backends to run NeMo Curator pipelines efficiently across single machines, multi-GPU systems, and distributed clusters.
Overview
Execution backends (executors) are the engines that run NeMo Curator Pipeline workflows across your compute resources. They handle:
- Task Distribution: Distribute pipeline stages across available workers and GPUs
- Resource Management: Allocate CPU, GPU, and memory resources to processing tasks
- Scaling: Automatically or manually scale processing based on workload
- Data Movement: Optimize data transfer between pipeline stages
Choosing the right executor impacts:
- Pipeline performance and throughput
- Resource utilization efficiency
- Ease of deployment and monitoring
This guide covers all execution backends available in NeMo Curator and applies to all modalities: text, image, video, and audio curation.
Basic Usage Pattern
All pipelines follow this standard execution pattern:
Key points:
- The same pipeline definition works with any executor
- Executor choice is independent of pipeline stages
- Switch executors without changing pipeline code
Available Backends
XennaExecutor (recommended)
XennaExecutor is the production-ready executor that uses Cosmos-Xenna, a Ray-based execution engine optimized for distributed data processing. Xenna provides native streaming support, automatic resource scaling, and built-in fault tolerance. It’s the recommended choice for most production workloads, especially for video and multimodal pipelines.
Key Features:
- Streaming execution: Process data incrementally as it arrives, reducing memory requirements
- Auto-scaling: Dynamically adjusts worker allocation based on stage throughput
- Fault tolerance: Built-in error handling and recovery mechanisms
- Resource optimization: Efficient CPU and GPU allocation for video/multimodal workloads
Configuration Parameters:
For more details, refer to the official NVIDIA Cosmos-Xenna project.
RayDataExecutor
RayDataExecutor uses Ray Data, a scalable data processing library built on Ray Core. Ray Data provides a familiar DataFrame-like API for distributed data transformations. This executor is experimental and best suited for large-scale batch processing tasks that benefit from Ray Data’s optimized data loading and transformation pipelines.
Key Features:
- Ray Data API: Leverages Ray Data’s optimized data processing primitives
- Scalable transformations: Efficient map-batch operations across distributed workers
- Experimental status: API and performance characteristics may change
RayDataExecutor currently has limited configuration options. For more control over execution, consider using XennaExecutor or RayActorPoolExecutor.
RayActorPoolExecutor
Executor using Ray Actor pools for custom distributed processing patterns such as deduplication.
Ray Executors in Practice
Ray-based executors provide enhanced scalability and performance for large-scale data processing tasks. They’re beneficial for:
- Large-scale classification tasks: Distributed inference across multi-GPU setups
- Deduplication workflows: Parallel processing of document similarity computations
- Resource-intensive stages: Automatic scaling based on computational demands
When to Use Ray Executors
Consider Ray executors when:
- Processing datasets that exceed single-machine capacity
- Running GPU-intensive stages (classifiers, embedding models, etc.)
- Needing automatic fault tolerance and recovery
- Scaling across multi-node clusters
Ray vs. Xenna Executors
Recommendation: Use XennaExecutor for production workloads and Ray executors for experimental large-scale processing.
Ray executors emit an experimental warning as the API and performance characteristics may change.
Choosing a Backend
Both options can deliver strong performance; choose based on API fit and maturity:
XennaExecutor: Default for most workloads due to maturity and extensive real-world usage (including video pipelines); supports streaming and batch execution with auto-scaling.- Ray Executors (experimental): Use Ray Data API for scalable data processing; the interface is still experimental and may change.