nemo_rl.distributed.virtual_cluster#

Module Contents#

Classes#

ClusterConfig

PY_EXECUTABLES

RayVirtualCluster

Creates a virtual distributed cluster using Ray placement groups.

Functions#

Data#

API#

nemo_rl.distributed.virtual_cluster.logger#

‘getLogger(…)’

class nemo_rl.distributed.virtual_cluster.ClusterConfig[source]#

Bases: typing.TypedDict

gpus_per_node: int#

None

num_nodes: int#

None

nemo_rl.distributed.virtual_cluster.dir_path#

‘dirname(…)’

nemo_rl.distributed.virtual_cluster.git_root#

‘abspath(…)’

class nemo_rl.distributed.virtual_cluster.PY_EXECUTABLES[source]#
SYSTEM#

None

BASE#

‘uv run –locked’

VLLM#

‘uv run –locked –extra vllm’

nemo_rl.distributed.virtual_cluster._get_node_ip_and_free_port()#
nemo_rl.distributed.virtual_cluster.init_ray(log_dir: Optional[str] = None)[source]#

Initialise Ray.

Try to attach to an existing local cluster. If that cluster uses the same CUDA_VISIBLE_DEVICES or Slurm managed tag we will reuse it. Otherwise, we will detach and start a fresh local cluster.

exception nemo_rl.distributed.virtual_cluster.ResourceInsufficientError[source]#

Bases: Exception

Exception raised when the cluster does not have enough resources to satisfy the requested configuration.

Initialization

Initialize self. See help(type(self)) for accurate signature.

class nemo_rl.distributed.virtual_cluster.RayVirtualCluster(
bundle_ct_per_node_list: List[int],
use_gpus: bool = True,
max_colocated_worker_groups: int = 1,
num_gpus_per_node: int = 8,
name: str = '',
placement_group_strategy: str = 'STRICT_PACK',
)[source]#

Creates a virtual distributed cluster using Ray placement groups.

This class simplifies distributed training setup by:

  • Creating placement groups that represent logical compute nodes

  • Allocating GPU and CPU resources for distributed workers

  • Managing communication between distributed processes

  • Bundle: A resource allocation unit (ex: 4 GPUs on a single node)

  • Worker: A process that performs computation (model training/inference)

  • Node: A physical or virtual machine containing multiple bundles

Initialization

Initialize a virtual cluster using Ray placement groups.

Parameters:
  • bundle_ct_per_node_list – List specifying GPU bundles per node (e.g., [2,2] creates 2 nodes with 2 GPU bundles each)

  • use_gpus – Whether to allocate GPU resources

  • max_colocated_worker_groups – Maximum number of worker groups that can be colocated

  • num_gpus_per_node – Number of GPUs per node

  • name – Name prefix for placement groups

  • placement_group_strategy – Ray placement group strategy (“STRICT_PACK”, “PACK”, or “SPREAD”)

_init_placement_groups(strategy: str)[source]#

Creates placement groups for each node in the cluster. Has empty groups for nodes that don’t have any bundles.

Parameters:

strategy – Ray placement group strategy

Returns:

List of placement groups, one per node

get_placement_groups()[source]#

Returns a list of placement groups that have at least one bundle, filtering out empty nodes.

This represents the “virtual cluster” - only nodes that are actually being used.

Returns:

List of placement groups that have at least one bundle

world_size()[source]#
node_count()[source]#
get_master_address_and_port()[source]#

Gets the master address and port for the distributed training setup.

Returns:

Tuple of (address, port)

shutdown()[source]#

Cleans up and releases all resources associated with this virtual cluster.

This includes removing all placement groups and resetting the internal state.

This method is idempotent and can be safely called multiple times.

_create_visualization_grid(worker_groups=None, is_global_view=False)[source]#

Create a visualization grid for the cluster with optional worker groups.

Parameters:
  • worker_groups – Single worker group, list of worker groups, or None

  • is_global_view – Whether this is a global view (multiple worker groups) or single view

Returns:

A dictionary containing the grid data for display

Return type:

dict

_get_worker_cells(
node_idx,
gpu_idx,
worker_groups,
cell_width,
is_global_view,
)[source]#

Get the worker cell content for each worker group at a specific GPU location.

Parameters:
  • node_idx – The node index

  • gpu_idx – The GPU index within the node

  • worker_groups – List of worker groups to check

  • cell_width – Width of each cell for formatting

  • is_global_view – Whether this is a global view with multiple worker groups

Returns:

List of formatted worker cells, one per worker group

Return type:

list

_print_visualization(grid_data)[source]#

Print the visualization based on the grid data.

Parameters:

grid_data – The grid data generated by _create_visualization_grid

_print_legend(grid_data)[source]#

Print the legend for the visualization.

print_cluster_grid(worker_group=None)[source]#

Prints a compact grid visualization of the virtual cluster, similar to JAX’s visualize_array_sharding.

If a worker_group is provided, it will also show worker assignments on each device.

Parameters:

worker_group – Optional RayWorkerGroup instance to visualize worker assignments

print_all_worker_groups(worker_groups=None)[source]#

Prints a visualization showing all worker groups in the cluster.

This provides a global view of all workers across all worker groups.

Parameters:

worker_groups – List of RayWorkerGroup instances to visualize. If None, no worker assignments will be shown.

__del__()[source]#

Shutsdown the virtual cluster when the object is deleted or is garbage collected.

This is an extra safety net in case the user forgets to call shutdown and the pointer to the cluster is lost due to leaving a function scope. It’s always recommended that the user calls shutdown().