Streaming | NeMo Curator

Different inference stages have different compute requirements. NeMo Curator uses Ray streaming to increase GPU utilization and processing speed compared to traditional batch-all-at-once approaches.

Batch Mode vs. Streaming Mode

Batch Mode

In batch mode, each stage processes the entire dataset before the next stage begins. Stages with different compute requirements (CPU-only tokenization, single-GPU classifiers, multi-GPU encoders) all run sequentially:

Aspect	Batch Mode
Execution	One stage at a time across the full dataset
Memory usage	Proportional to dataset size
GPU utilization	Low — GPUs idle while CPU stages run, and vice versa
Time to first output	After the entire pipeline finishes

Streaming Mode

In streaming mode, data flows through the pipeline as discrete batches. Each stage processes its current batch and immediately passes it downstream, so all stages run concurrently on different batches:

Aspect	Streaming Mode
Execution	All stages active simultaneously on different batches
Memory usage	Constant (proportional to batch size, not dataset size)
GPU utilization	High — stages with different hardware needs overlap
Time to first output	After the first batch completes the pipeline

Why Streaming Is Faster

Streaming with heterogeneous compute allows NeMo Curator to overlap stages that use different resources. For example, while a GPU inference stage processes batch N, a CPU tokenization stage can process batch N+1 simultaneously — neither blocks the other.

This overlap improves throughput in pipelines that mix CPU and GPU work, because both happen in parallel rather than taking turns.

Combined with auto-balancing, streaming enables Curator to rearrange resources so that GPU stage workers are kept busy over 99% of the time after an initial warm-up period.

Heterogeneous Executors

NeMo Curator supports streaming with multiple executors — Cosmos Xenna, Ray Data, and others — each optimized for different workload patterns. The executor handles scheduling, backpressure, and resource allocation so that streaming “just works” regardless of how many stages your pipeline has.

Configuring Batch Size

Batch size controls the trade-off between memory usage and throughput:

1 from nemo_curator.stages.resources import Resources
2 
3 # Configure batch size on a stage
4 word_count_stage = WordCountStage().with_(batch_size=128)

Smaller batches: Lower memory usage per batch. Ray may handle smaller batches more efficiently in some workloads.
Larger batches: More in-memory data per batch, which can reduce I/O overhead but uses more memory.

When Streaming Matters Most

Datasets exceed memory. A 10 TB Common Crawl snapshot cannot fit in RAM, but it can be streamed in manageable chunks.
Pipeline stages have different hardware needs. CPU-only and GPU stages overlap instead of taking turns.
You need early results. Inspect output from the first batch while the rest of the dataset is still processing.