Auto-Balancing Heterogeneous Models
Auto-Balancing Heterogeneous Models
NeMo Curator auto-balances resources at the application level across pipeline stages to maximize throughput. This means you can focus on defining your curation logic rather than manually tuning parallelism.
The Problem: Unbalanced Pipelines
In a typical curation pipeline, stages have very different processing speeds. Consider a video pipeline with a fast stage, a slow stage, and a medium stage — all sharing a fixed GPU budget:
Without auto-balancing (3 GPUs, 1 worker per stage):
- The fast stage emits 4 tasks/s, but the slow stage can only handle 1 task/s. Jobs back up in the queue — 3 tasks accumulate per second, eventually causing memory pressure.
- The medium stage produces 2 tasks/s but is limited by the slow stage upstream, so it’s starved for work. In practice only ~1 task/s is realized.
- 4 queued tasks per second could have been processed, but the final stage is starved.
With auto-balancing (7 GPUs, scaled workers):
- The executor detects the bottleneck and scales the slow stage to 4× workers and the medium stage to 2× workers.
- Now every stage sustains 4 tasks/s throughput. Queues stay relatively clear — new jobs are picked up promptly.
- Result: 4× throughput improvement by intelligently redistributing the same GPU budget.
How Auto-Balancing Works
The executor monitors the throughput and queue depth of each stage at runtime and uses this information to:
- Monitor throughput of different stages and rebalance resources at regular intervals, shifting GPU/CPU allocations toward bottleneck stages.
- Apply backpressure. When a downstream stage can’t keep up, upstream stages slow their output rate rather than buffering unbounded data in memory. This reduces memory pressure and prevents spilling to disk.
- Scale workers dynamically. If a stage is falling behind, the executor allocates additional workers to that stage (within the available resource budget).
What This Means for You
- No manual parallelism tuning. You don’t need to calculate the optimal number of workers per stage — the executor adapts at runtime.
- Predictable memory usage. Backpressure prevents unbounded buffering, so memory usage stays stable even with unbalanced stages.
- Efficient hardware utilization. Resources shift toward the current bottleneck instead of being statically allocated.
Monitoring Stage Balance
Use the Ray Dashboard to monitor how the executor is balancing your pipeline. If you notice a persistent bottleneck that auto-balancing can’t resolve (for example, a stage that needs more GPU memory than is available), consider splitting the pipeline or scaling your cluster.