About NeMo CuratorConceptsScaling & Performance

Scaling Up with Ray Resource Allocation

View as Markdown

NeMo Curator makes resource allocation across pipeline stages straightforward — both CPUs and GPUs. Each stage declares its own resource requirements, and the executor schedules work accordingly.

This design improves performance on CPU-only stages in pipelines that also use GPUs, because CPU stages no longer block GPU resources.

How It Works

Every ProcessingStage can specify a Resources object that declares its CPU and GPU needs:

1from nemo_curator.stages.core import ProcessingStage
2from nemo_curator.stages.function_definitions import processing_stage
3from nemo_curator.stages.resources import Resources
4from nemo_curator.tasks import DocumentBatch
5
6
7class TokenizerStage(ProcessingStage[DocumentBatch, DocumentBatch]):
8 name: str = "TokenizerStage"
9 resources: Resources = Resources(cpus=1.0) # CPU-only — no GPU needed
10
11 def __init__(self):
12 super().__init__()
13 # ... stage logic ...
14
15
16class ModelStage(ProcessingStage[DocumentBatch, DocumentBatch]):
17 name: str = "ModelStage"
18
19 def __init__(self, model_path: str):
20 super().__init__()
21 # ... stage logic ...
22 pass
23
24
25@processing_stage(name="custom_filter", resources=Resources(cpus=1))
26def custom_filter_stage(task: DocumentBatch) -> DocumentBatch:
27 # ... filter logic ...
28 pass
29
30
31model_stage = ModelStage(model_path="path/to/model").with_(resources=Resources(gpus=1))

When a pipeline runs, the executor reads each stage’s resource declaration and schedules tasks to satisfy those constraints. Stages that need GPUs are placed on GPU-equipped nodes; CPU-only stages can run on any available worker.

Key Concepts

CPU-Only vs. GPU Stages

The most impactful optimization is correctly separating CPU and GPU work. In a mixed pipeline, CPU-only stages (tokenization, text parsing, filtering) should not request GPU resources — this frees GPUs for inference stages that actually need them:

1# CPU-only: tokenization, filtering, I/O
2# Runs on any worker, doesn't block GPU resources
3@processing_stage(name="tokenizer", resources=Resources(cpus=1))
4def tokenizer_stage(task: DocumentBatch) -> DocumentBatch:
5 pass
6
7# GPU: model inference, embeddings
8# Scheduled only on GPU-equipped nodes
9model_stage = ModelStage(model_path="path/to/model").with_(resources=Resources(gpus=1))

Fractional GPU Allocation

Some GPU stages don’t need an entire GPU. You can use fractional allocation via Resources(gpus=0.25) or reserve a specific amount of GPU memory with Resources(gpu_memory_gb=10):

1# 4 workers share one GPU via fractional allocation
2model_stage = ModelStage(model_path="path/to/model").with_(resources=Resources(gpus=0.25))
3
4# Or reserve a specific amount of GPU memory
5model_stage = ModelStage(model_path="path/to/model").with_(resources=Resources(gpu_memory_gb=10))

This is useful for inference stages where the model fits in a fraction of GPU memory, allowing you to increase parallelism without requiring more hardware. Note that gpu_memory_gb sets GPU memory for a single GPU.

Best Practices

  • Start with defaults. Most stages have sensible default resource declarations. Override only when you observe resource contention or underutilization.
  • Separate CPU and GPU stages. This is the single highest-impact optimization — it allows the executor to parallelize across heterogeneous hardware.
  • Profile before tuning. Use Ray Dashboard or stage performance stats to identify bottlenecks before adjusting allocations.
  • Match hardware to workload. If your pipeline is mostly CPU-bound (text filtering), you may not need GPU nodes at all.