NeMo Curator is designed to scale from a single GPU to multi-node clusters with near-linear performance gains. This guide covers why data curation is a throughput problem, how Curator solves it, and the key levers for maximizing performance.
Data curation pipelines process large numbers of samples. The goal is to minimize total runtime across all samples — not just the latency of a single sample. This means throughput (samples processed per unit time) matters more than per-sample latency.
Consider a pipeline that processes 1,000 questions through three stages on a single GPU (102 GB memory):
Naive sequential approach: Process each question through all three stages one at a time: 1,000 × (2 + 1 + 10) = 13,000 seconds.
This has three problems:
NeMo Curator’s approach: Stream batches through the pipeline, auto-scale replicas per stage based on throughput, and overlap CPU/GPU work:
By running 10 replicas of the bottleneck stage (answer model, using 10 × 10 GB = 100 GB) and 2 replicas of language identification (2 × 1 GB = 2 GB), the full 102 GB GPU memory is utilized and every stage achieves the same throughput of 1 task/second. Streaming enables this by passing batches between stages concurrently — while the answer model processes batch N, language identification processes batch N+1, and tokenization runs on CPU in parallel. After an initial warm-up period, Curator rearranges resources so GPU workers are kept busy over 99% of the time. Result: ~1,000 seconds — a 13× improvement on the same hardware.
This is an illustrative example to demonstrate the principles. Actual speedups depend on your specific pipeline, hardware, and data characteristics. The key insight is that Curator’s streaming and auto-balancing automatically solve the throughput optimization problem that would otherwise require manual tuning.
Benchmarks on an 8 TB RedPajama v2 dataset (1.78 trillion tokens) demonstrate near-linear scaling:
This near-linear scaling holds because NeMo Curator partitions work across nodes with minimal cross-node communication for most pipeline stages.
The most straightforward way to increase throughput. When you add nodes to your Ray cluster, the executor automatically distributes pipeline stages across the expanded cluster.
Use RayClient for single-node setups or RaySlurmClient for multi-node SLURM clusters:
Larger batches amortize fixed costs (model loading, scheduling overhead) but use more memory. Find the largest batch size that fits in your hardware:
Ensure GPU-heavy stages request enough GPU resources, and CPU-heavy stages don’t unnecessarily block GPU workers:
If a stage is consistently slower than others, Curator’s auto-balancing will automatically assign more workers to it. You can also proactively assign more GPU resources to a stage you know will be a bottleneck.
The streaming architecture means multiple stages run concurrently on different batches. Ensure your pipeline has enough stages to keep all hardware busy — a two-stage pipeline (read → write) won’t saturate a large cluster.
Use Ray Dashboard to identify bottlenecks. Common issues: