Different inference stages have different compute requirements. NeMo Curator uses Ray streaming to increase GPU utilization and processing speed compared to traditional batch-all-at-once approaches.
In batch mode, each stage processes the entire dataset before the next stage begins. Stages with different compute requirements (CPU-only tokenization, single-GPU classifiers, multi-GPU encoders) all run sequentially:
In streaming mode, data flows through the pipeline as discrete batches. Each stage processes its current batch and immediately passes it downstream, so all stages run concurrently on different batches:
Streaming with heterogeneous compute allows NeMo Curator to overlap stages that use different resources. For example, while a GPU inference stage processes batch N, a CPU tokenization stage can process batch N+1 simultaneously — neither blocks the other.
This overlap improves throughput in pipelines that mix CPU and GPU work, because both happen in parallel rather than taking turns.
Combined with auto-balancing, streaming enables Curator to rearrange resources so that GPU stage workers are kept busy over 99% of the time after an initial warm-up period.
NeMo Curator supports streaming with multiple executors — Cosmos Xenna, Ray Data, and others — each optimized for different workload patterns. The executor handles scheduling, backpressure, and resource allocation so that streaming “just works” regardless of how many stages your pipeline has.
Batch size controls the trade-off between memory usage and throughput: