nemo_automodel.components.distributed.init_utils

View as Markdown

Module Contents

Classes

NameDescription
DistInfoHolds information about the distributed training environment.

Functions

NameDescription
destroy_global_stateDestroy the torch.distributed process group during cleanup.
get_local_rank_preinitGet the local rank from the environment variable, intended for use before full init.
get_local_world_size_preinitGet the local world size from the environment variable, intended for use before full init.
get_rank_safeGet the distributed rank safely, even if torch.distributed is not initialized.
get_world_size_safeGet the distributed world size safely, even if torch.distributed is not initialized.
initialize_distributedInitialize the torch.distributed environment and core model parallel infrastructure.

API

class nemo_automodel.components.distributed.init_utils.DistInfo(
backend: str,
rank: int,
world_size: int,
device: torch.device,
is_main: bool
)
Dataclass

Holds information about the distributed training environment.

backend
str
device
device
is_main
bool
rank
int
world_size
int
nemo_automodel.components.distributed.init_utils.destroy_global_state()

Destroy the torch.distributed process group during cleanup.

This function is registered to execute at exit to ensure the process group is properly destroyed. It temporarily ignores SIGINT to avoid interruption during cleanup.

For MoE runs that use DeepEP, the process-global DeepEP Buffer (NVSHMEM symmetric memory plus its own NCCL sub-groups) is destroyed before destroy_process_group(). Tearing the buffer down first releases that pending collective state; otherwise destroy_process_group() hangs on it. Freeing the buffer is a no-op when DeepEP was never used.

nemo_automodel.components.distributed.init_utils.get_local_rank_preinit() -> int

Get the local rank from the environment variable, intended for use before full init.

Returns: int

The local rank of the current process.

nemo_automodel.components.distributed.init_utils.get_local_world_size_preinit() -> int

Get the local world size from the environment variable, intended for use before full init.

Returns: int

The local world size of the current process.

nemo_automodel.components.distributed.init_utils.get_rank_safe() -> int

Get the distributed rank safely, even if torch.distributed is not initialized.

Returns: int

The current process rank.

nemo_automodel.components.distributed.init_utils.get_world_size_safe() -> int

Get the distributed world size safely, even if torch.distributed is not initialized.

Returns: int

The total number of processes in the distributed job.

nemo_automodel.components.distributed.init_utils.initialize_distributed(
backend,
timeout_minutes = 1
)

Initialize the torch.distributed environment and core model parallel infrastructure.

This function sets the device based on the local rank, configures the process group, and calls torch.distributed.init_process_group with the appropriate parameters. It also registers a cleanup function to properly destroy the process group at exit.

Parameters:

backend
str

The backend to use for torch.distributed (e.g., ‘nccl’).

timeout_minutes
intDefaults to 1

Timeout (in minutes) for distributed initialization. Defaults to 1.

Returns:

An instance containing the distributed environment configuration.