nemo_curator.backends.utils
nemo_curator.backends.utils
Module Contents
Classes
Functions
API
Bases: enum.Enum
String enum of different flags that define keys inside ray_stage_spec.
Ray remote function to execute setup_on_node for a stage.
This runs as a Ray remote task (not an actor). vLLM’s auto-detection only forces the spawn multiprocessing method inside Ray actors, not in Ray tasks. Without this override, vLLM defaults to fork in tasks and hits RuntimeError: Cannot re-initialize CUDA in forked subprocess. We explicitly set the environment variable to spawn to prevent this.
Raise if the cluster doesn’t have enough GPUs to satisfy aggregate demand.
Intended as a coarse pre-check before submitting placement groups: Ray’s
PG scheduler can hang indefinitely on pg.ready() when demand exceeds
capacity, so a fast, explicit error with the actual numbers is friendlier
than waiting on a timeout.
Execute setup_on_node for every stage on every alive Ray node.
All (stage, node) setup tasks are submitted up front and awaited with a single
ray.get, so total wall-clock time is bounded by the slowest stage rather than
the sum of per-stage times — important when setup is heavy (model downloads, weight
loads) and stages don’t contend for the same resources.
Get available CPU and GPU resources from Ray.
Get the worker metadata and node id from the runtime context.
Recursively merge two executor configs with deep merging of nested dicts.
Parameters:
Base configuration dictionary
Configuration to merge on top of base_config
Returns: dict
Merged configuration dictionary with all nested dicts recursively merged
Examples:
Initialize a new local Ray cluster or connects to an existing one.