nemo_automodel.components.distributed.init_utils
#
Module Contents#
Classes#
Holds information about the distributed training environment. |
Functions#
Get the distributed rank safely, even if torch.distributed is not initialized. |
|
Get the distributed world size safely, even if torch.distributed is not initialized. |
|
Get the local rank from the environment variable, intended for use before full init. |
|
Initialize the torch.distributed environment and core model parallel infrastructure. |
|
Destroy the torch.distributed process group during cleanup. |
API#
- nemo_automodel.components.distributed.init_utils.get_rank_safe() int [source]#
Get the distributed rank safely, even if torch.distributed is not initialized.
- Returns:
The current process rank.
- nemo_automodel.components.distributed.init_utils.get_world_size_safe() int [source]#
Get the distributed world size safely, even if torch.distributed is not initialized.
- Returns:
The total number of processes in the distributed job.
- nemo_automodel.components.distributed.init_utils.get_local_rank_preinit() int [source]#
Get the local rank from the environment variable, intended for use before full init.
- Returns:
The local rank of the current process.
- class nemo_automodel.components.distributed.init_utils.DistInfo[source]#
Holds information about the distributed training environment.
.. attribute:: backend
The backend used for torch.distributed (e.g., ‘nccl’).
- Type:
str
.. attribute:: rank
The rank of the current process.
- Type:
int
.. attribute:: world_size
The total number of processes.
- Type:
int
.. attribute:: device
The device assigned to the current process.
- Type:
torch.device
.. attribute:: is_main
True if the process is the main process (rank 0).
- Type:
bool
- backend: str#
None
- rank: int#
None
- world_size: int#
None
- device: torch.device#
None
- is_main: bool#
None
- nemo_automodel.components.distributed.init_utils.initialize_distributed(backend, timeout_minutes=1)[source]#
Initialize the torch.distributed environment and core model parallel infrastructure.
This function sets the device based on the local rank, configures the process group, and calls torch.distributed.init_process_group with the appropriate parameters. It also registers a cleanup function to properly destroy the process group at exit.
- Parameters:
backend (str) – The backend to use for torch.distributed (e.g., ‘nccl’).
timeout_minutes (int, optional) – Timeout (in minutes) for distributed initialization. Defaults to 1.
- Returns:
An instance containing the distributed environment configuration.
- Return type:
- nemo_automodel.components.distributed.init_utils.destroy_global_state()[source]#
Destroy the torch.distributed process group during cleanup.
This function is registered to execute at exit to ensure the process group is properly destroyed. It temporarily ignores SIGINT to avoid interruption during cleanup.