nemo_rl.distributed.virtual_cluster
#
Module Contents#
Classes#
Creates a virtual distributed cluster using Ray placement groups. |
Functions#
Initialise Ray. |
Data#
API#
- nemo_rl.distributed.virtual_cluster.logger#
‘getLogger(…)’
- class nemo_rl.distributed.virtual_cluster.ClusterConfig[source]#
Bases:
typing.TypedDict
- gpus_per_node: int#
None
- num_nodes: int#
None
- nemo_rl.distributed.virtual_cluster.dir_path#
‘dirname(…)’
- nemo_rl.distributed.virtual_cluster.git_root#
‘abspath(…)’
- class nemo_rl.distributed.virtual_cluster.PY_EXECUTABLES[source]#
- SYSTEM#
None
- BASE#
‘uv run –locked’
- VLLM#
‘uv run –locked –extra vllm’
- nemo_rl.distributed.virtual_cluster._get_node_ip_and_free_port()#
- nemo_rl.distributed.virtual_cluster.init_ray(log_dir: Optional[str] = None)[source]#
Initialise Ray.
Try to attach to an existing local cluster. If that cluster uses the same CUDA_VISIBLE_DEVICES or Slurm managed tag we will reuse it. Otherwise, we will detach and start a fresh local cluster.
- exception nemo_rl.distributed.virtual_cluster.ResourceInsufficientError[source]#
Bases:
Exception
Exception raised when the cluster does not have enough resources to satisfy the requested configuration.
Initialization
Initialize self. See help(type(self)) for accurate signature.
- class nemo_rl.distributed.virtual_cluster.RayVirtualCluster(
- bundle_ct_per_node_list: List[int],
- use_gpus: bool = True,
- max_colocated_worker_groups: int = 1,
- num_gpus_per_node: int = 8,
- name: str = '',
- placement_group_strategy: str = 'STRICT_PACK',
Creates a virtual distributed cluster using Ray placement groups.
This class simplifies distributed training setup by:
Creating placement groups that represent logical compute nodes
Allocating GPU and CPU resources for distributed workers
Managing communication between distributed processes
Bundle: A resource allocation unit (ex: 4 GPUs on a single node)
Worker: A process that performs computation (model training/inference)
Node: A physical or virtual machine containing multiple bundles
Initialization
Initialize a virtual cluster using Ray placement groups.
- Parameters:
bundle_ct_per_node_list – List specifying GPU bundles per node (e.g., [2,2] creates 2 nodes with 2 GPU bundles each)
use_gpus – Whether to allocate GPU resources
max_colocated_worker_groups – Maximum number of worker groups that can be colocated
num_gpus_per_node – Number of GPUs per node
name – Name prefix for placement groups
placement_group_strategy – Ray placement group strategy (“STRICT_PACK”, “PACK”, or “SPREAD”)
- _init_placement_groups(strategy: str)[source]#
Creates placement groups for each node in the cluster. Has empty groups for nodes that don’t have any bundles.
- Parameters:
strategy – Ray placement group strategy
- Returns:
List of placement groups, one per node
- get_placement_groups()[source]#
Returns a list of placement groups that have at least one bundle, filtering out empty nodes.
This represents the “virtual cluster” - only nodes that are actually being used.
- Returns:
List of placement groups that have at least one bundle
- get_master_address_and_port()[source]#
Gets the master address and port for the distributed training setup.
- Returns:
Tuple of (address, port)
- shutdown()[source]#
Cleans up and releases all resources associated with this virtual cluster.
This includes removing all placement groups and resetting the internal state.
This method is idempotent and can be safely called multiple times.
- _create_visualization_grid(worker_groups=None, is_global_view=False)[source]#
Create a visualization grid for the cluster with optional worker groups.
- Parameters:
worker_groups – Single worker group, list of worker groups, or None
is_global_view – Whether this is a global view (multiple worker groups) or single view
- Returns:
A dictionary containing the grid data for display
- Return type:
dict
- _get_worker_cells(
- node_idx,
- gpu_idx,
- worker_groups,
- cell_width,
- is_global_view,
Get the worker cell content for each worker group at a specific GPU location.
- Parameters:
node_idx – The node index
gpu_idx – The GPU index within the node
worker_groups – List of worker groups to check
cell_width – Width of each cell for formatting
is_global_view – Whether this is a global view with multiple worker groups
- Returns:
List of formatted worker cells, one per worker group
- Return type:
list
- _print_visualization(grid_data)[source]#
Print the visualization based on the grid data.
- Parameters:
grid_data – The grid data generated by _create_visualization_grid
- print_cluster_grid(worker_group=None)[source]#
Prints a compact grid visualization of the virtual cluster, similar to JAX’s visualize_array_sharding.
If a worker_group is provided, it will also show worker assignments on each device.
- Parameters:
worker_group – Optional RayWorkerGroup instance to visualize worker assignments
- print_all_worker_groups(worker_groups=None)[source]#
Prints a visualization showing all worker groups in the cluster.
This provides a global view of all workers across all worker groups.
- Parameters:
worker_groups – List of RayWorkerGroup instances to visualize. If None, no worker assignments will be shown.