> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

> How NeMo Curator's streaming architecture processes data in constant memory

# Streaming vs. Batch Processing

Different inference stages have different compute requirements. NeMo Curator uses Ray streaming to increase GPU utilization and processing speed compared to traditional batch-all-at-once approaches.

## Batch Mode vs. Streaming Mode

### Batch Mode

In batch mode, each stage processes the **entire dataset** before the next stage begins. Stages with different compute requirements (CPU-only tokenization, single-GPU classifiers, multi-GPU encoders) all run sequentially:

| Aspect                   | Batch Mode                                           |
| ------------------------ | ---------------------------------------------------- |
| **Execution**            | One stage at a time across the full dataset          |
| **Memory usage**         | Proportional to dataset size                         |
| **GPU utilization**      | Low — GPUs idle while CPU stages run, and vice versa |
| **Time to first output** | After the entire pipeline finishes                   |

### Streaming Mode

In streaming mode, data flows through the pipeline as discrete batches. Each stage processes its current batch and immediately passes it downstream, so **all stages run concurrently** on different batches:

| Aspect                   | Streaming Mode                                          |
| ------------------------ | ------------------------------------------------------- |
| **Execution**            | All stages active simultaneously on different batches   |
| **Memory usage**         | Constant (proportional to batch size, not dataset size) |
| **GPU utilization**      | High — stages with different hardware needs overlap     |
| **Time to first output** | After the first batch completes the pipeline            |

## Why Streaming Is Faster

Streaming with heterogeneous compute allows NeMo Curator to overlap stages that use different resources. For example, while a GPU inference stage processes batch N, a CPU tokenization stage can process batch N+1 simultaneously — neither blocks the other.

This overlap improves throughput in pipelines that mix CPU and GPU work, because both happen in parallel rather than taking turns.

Combined with auto-balancing, streaming enables Curator to rearrange resources so that GPU stage workers are **kept busy over 99% of the time** after an initial warm-up period.

## Heterogeneous Executors

NeMo Curator supports streaming with multiple executors — Cosmos Xenna, Ray Data, and others — each optimized for different workload patterns. The executor handles scheduling, backpressure, and resource allocation so that streaming "just works" regardless of how many stages your pipeline has.

## Configuring Batch Size

Batch size controls the trade-off between memory usage and throughput:

```python
from nemo_curator.stages.resources import Resources

# Configure batch size on a stage
word_count_stage = WordCountStage().with_(batch_size=128)
```

* **Smaller batches**: Lower memory usage per batch. Ray may handle smaller batches more efficiently in some workloads.
* **Larger batches**: More in-memory data per batch, which can reduce I/O overhead but uses more memory.

## When Streaming Matters Most

* **Datasets exceed memory.** A 10 TB Common Crawl snapshot cannot fit in RAM, but it can be streamed in manageable chunks.
* **Pipeline stages have different hardware needs.** CPU-only and GPU stages overlap instead of taking turns.
* **You need early results.** Inspect output from the first batch while the rest of the dataset is still processing.