nemo_automodel.distributed.init_utils#

Module Contents#

Classes#

DistInfo

Holds information about the distributed training environment.

Functions#

initialize_distributed

Initialize the torch.distributed environment and core model parallel infrastructure.

destroy_global_state

Destroy the torch.distributed process group during cleanup.

API#

class nemo_automodel.distributed.init_utils.DistInfo[source]#

Holds information about the distributed training environment.

.. attribute:: backend

The backend used for torch.distributed (e.g., ‘nccl’).

Type:

str

.. attribute:: rank

The rank of the current process.

Type:

int

.. attribute:: world_size

The total number of processes.

Type:

int

.. attribute:: device

The device assigned to the current process.

Type:

torch.device

.. attribute:: is_main

True if the process is the main process (rank 0).

Type:

bool

backend: str#

None

rank: int#

None

world_size: int#

None

device: torch.device#

None

is_main: bool#

None

nemo_automodel.distributed.init_utils.initialize_distributed(backend, timeout_minutes=1)[source]#

Initialize the torch.distributed environment and core model parallel infrastructure.

This function sets the device based on the local rank, configures the process group, and calls torch.distributed.init_process_group with the appropriate parameters. It also registers a cleanup function to properly destroy the process group at exit.

Parameters:
  • backend (str) – The backend to use for torch.distributed (e.g., ‘nccl’).

  • timeout_minutes (int, optional) – Timeout (in minutes) for distributed initialization. Defaults to 1.

Returns:

An instance containing the distributed environment configuration.

Return type:

DistInfo

nemo_automodel.distributed.init_utils.destroy_global_state()[source]#

Destroy the torch.distributed process group during cleanup.

This function is registered to execute at exit to ensure the process group is properly destroyed. It temporarily ignores SIGINT to avoid interruption during cleanup.