About NeMo CuratorConceptsScaling & Performance

Streaming vs. Batch Processing

View as Markdown

Different inference stages have different compute requirements. NeMo Curator uses Ray streaming to increase GPU utilization and processing speed compared to traditional batch-all-at-once approaches.

Batch Mode vs. Streaming Mode

Batch Mode

In batch mode, each stage processes the entire dataset before the next stage begins. Stages with different compute requirements (CPU-only tokenization, single-GPU classifiers, multi-GPU encoders) all run sequentially:

AspectBatch Mode
ExecutionOne stage at a time across the full dataset
Memory usageProportional to dataset size
GPU utilizationLow — GPUs idle while CPU stages run, and vice versa
Time to first outputAfter the entire pipeline finishes

Streaming Mode

In streaming mode, data flows through the pipeline as discrete batches. Each stage processes its current batch and immediately passes it downstream, so all stages run concurrently on different batches:

AspectStreaming Mode
ExecutionAll stages active simultaneously on different batches
Memory usageConstant (proportional to batch size, not dataset size)
GPU utilizationHigh — stages with different hardware needs overlap
Time to first outputAfter the first batch completes the pipeline

Why Streaming Is Faster

Streaming with heterogeneous compute allows NeMo Curator to overlap stages that use different resources. For example, while a GPU inference stage processes batch N, a CPU tokenization stage can process batch N+1 simultaneously — neither blocks the other.

This overlap improves throughput in pipelines that mix CPU and GPU work, because both happen in parallel rather than taking turns.

Combined with auto-balancing, streaming enables Curator to rearrange resources so that GPU stage workers are kept busy over 99% of the time after an initial warm-up period.

Heterogeneous Executors

NeMo Curator supports streaming with multiple executors — Cosmos Xenna, Ray Data, and others — each optimized for different workload patterns. The executor handles scheduling, backpressure, and resource allocation so that streaming “just works” regardless of how many stages your pipeline has.

Configuring Batch Size

Batch size controls the trade-off between memory usage and throughput:

1from nemo_curator.stages.resources import Resources
2
3# Configure batch size on a stage
4word_count_stage = WordCountStage().with_(batch_size=128)
  • Smaller batches: Lower memory usage per batch. Ray may handle smaller batches more efficiently in some workloads.
  • Larger batches: More in-memory data per batch, which can reduce I/O overhead but uses more memory.

When Streaming Matters Most

  • Datasets exceed memory. A 10 TB Common Crawl snapshot cannot fit in RAM, but it can be streamed in manageable chunks.
  • Pipeline stages have different hardware needs. CPU-only and GPU stages overlap instead of taking turns.
  • You need early results. Inspect output from the first batch while the rest of the dataset is still processing.