nemo_automodel.distributed.init_utils
#
Module Contents#
Classes#
Holds information about the distributed training environment. |
Functions#
Initialize the torch.distributed environment and core model parallel infrastructure. |
|
Destroy the torch.distributed process group during cleanup. |
API#
- class nemo_automodel.distributed.init_utils.DistInfo[source]#
Holds information about the distributed training environment.
.. attribute:: backend
The backend used for torch.distributed (e.g., ‘nccl’).
- Type:
str
.. attribute:: rank
The rank of the current process.
- Type:
int
.. attribute:: world_size
The total number of processes.
- Type:
int
.. attribute:: device
The device assigned to the current process.
- Type:
torch.device
.. attribute:: is_main
True if the process is the main process (rank 0).
- Type:
bool
- backend: str#
None
- rank: int#
None
- world_size: int#
None
- device: torch.device#
None
- is_main: bool#
None
- nemo_automodel.distributed.init_utils.initialize_distributed(backend, timeout_minutes=1)[source]#
Initialize the torch.distributed environment and core model parallel infrastructure.
This function sets the device based on the local rank, configures the process group, and calls torch.distributed.init_process_group with the appropriate parameters. It also registers a cleanup function to properly destroy the process group at exit.
- Parameters:
backend (str) – The backend to use for torch.distributed (e.g., ‘nccl’).
timeout_minutes (int, optional) – Timeout (in minutes) for distributed initialization. Defaults to 1.
- Returns:
An instance containing the distributed environment configuration.
- Return type: