`nemo_rl.distributed.virtual_cluster`#

Module Contents#

Classes#

`ClusterConfig`
`PY_EXECUTABLES`
`RayVirtualCluster`	Creates a virtual distributed cluster using Ray placement groups.

Functions#

`_get_node_ip_and_free_port`
`init_ray`	Initialise Ray.

Data#

`logger`
`dir_path`
`git_root`

API#

nemo_rl.distributed.virtual_cluster.logger#: ‘getLogger(…)’

class nemo_rl.distributed.virtual_cluster.ClusterConfig[source]#

Bases: typing.TypedDict

gpus_per_node: int#: None

num_nodes: int#: None

nemo_rl.distributed.virtual_cluster.dir_path#: ‘dirname(…)’

nemo_rl.distributed.virtual_cluster.git_root#: ‘abspath(…)’

class nemo_rl.distributed.virtual_cluster.PY_EXECUTABLES[source]#

SYSTEM#: None

BASE#: ‘uv run –locked’

VLLM#: ‘uv run –locked –extra vllm’

MCORE#: ‘uv run –reinstall –extra mcore’

nemo_rl.distributed.virtual_cluster._get_node_ip_and_free_port() → tuple[str, int]#

nemo_rl.distributed.virtual_cluster.init_ray(log_dir: Optional[str] = None) → None[source]#

Initialise Ray.

Try to attach to an existing local cluster. If that cluster uses the same CUDA_VISIBLE_DEVICES or Slurm managed tag we will reuse it. Otherwise, we will detach and start a fresh local cluster.

exception nemo_rl.distributed.virtual_cluster.ResourceInsufficientError[source]#

Bases: Exception

Exception raised when the cluster does not have enough resources to satisfy the requested configuration.

Initialization

Initialize self. See help(type(self)) for accurate signature.

class nemo_rl.distributed.virtual_cluster.RayVirtualCluster( bundle_ct_per_node_list: list[int], use_gpus: bool = True, max_colocated_worker_groups: int = 1, num_gpus_per_node: int = 8, name: str = '', placement_group_strategy: str = 'SPREAD', )[source]#

Creates a virtual distributed cluster using Ray placement groups.

This class simplifies distributed training setup by:

Creating placement groups that represent logical compute nodes
Allocating GPU and CPU resources for distributed workers
Managing communication between distributed processes
Bundle: A resource allocation unit (ex: 4 GPUs on a single node)
Worker: A process that performs computation (model training/inference)
Node: A physical or virtual machine containing multiple bundles

Initialization

Initialize a virtual cluster using Ray placement groups.

Parameters:

bundle_ct_per_node_list – List specifying GPU bundles per node (e.g., [2,2] creates 2 nodes with 2 GPU bundles each)
use_gpus – Whether to allocate GPU resources
max_colocated_worker_groups – Maximum number of worker groups that can be colocated
num_gpus_per_node – Number of GPUs per node
name – Name prefix for placement groups
placement_group_strategy – Ray placement group strategy (“STRICT_PACK”, “PACK”, or “SPREAD”)

_init_placement_groups( strategy: str | None = None, use_unified_pg: bool = False, ) → list[ray.util.placement_group.PlacementGroup][source]#

Creates placement groups based on whether cross-node model parallelism is needed.

Parameters:

strategy – Ray placement group strategy (defaults to self.placement_group_strategy)
use_unified_pg – If True, create a single unified placement group. If False, create per-node placement groups.

Returns:

List of placement groups

_create_placement_groups_internal( strategy: str, use_unified_pg: bool = False, ) → list[ray.util.placement_group.PlacementGroup][source]#: Internal method to create placement groups without retry logic.

get_placement_groups() → list[ray.util.placement_group.PlacementGroup][source]#

world_size() → int[source]#

node_count() → int[source]#

get_master_address_and_port() → tuple[str, int][source]#

Gets the master address and port for the distributed training setup.

Returns:: Tuple of (address, port)

shutdown() → bool[source]#

Cleans up and releases all resources associated with this virtual cluster.

This includes removing all placement groups and resetting the internal state.

This method is idempotent and can be safely called multiple times.

__del__() → None[source]#

Shutsdown the virtual cluster when the object is deleted or is garbage collected.

This is an extra safety net in case the user forgets to call shutdown and the pointer to the cluster is lost due to leaving a function scope. It’s always recommended that the user calls shutdown().

nemo_rl.distributed.virtual_cluster#