nemo_automodel.components.distributed.init_utils

Module Contents

Classes

Name	Description
`DistInfo`	Holds information about the distributed training environment.

Functions

Name	Description
`destroy_global_state`	Destroy the torch.distributed process group during cleanup.
`get_local_rank_preinit`	Get the local rank from the environment variable, intended for use before full init.
`get_local_world_size_preinit`	Get the local world size from the environment variable, intended for use before full init.
`get_rank_safe`	Get the distributed rank safely, even if torch.distributed is not initialized.
`get_world_size_safe`	Get the distributed world size safely, even if torch.distributed is not initialized.
`initialize_distributed`	Initialize the torch.distributed environment and core model parallel infrastructure.

API

class nemo_automodel.components.distributed.init_utils.DistInfo(
    backend: str,
    rank: int,
    world_size: int,
    device: torch.device,
    is_main: bool
)

Dataclass

Holds information about the distributed training environment.

backend

str

device

is_main

bool

rank

int

world_size

int

nemo_automodel.components.distributed.init_utils.destroy_global_state()

Destroy the torch.distributed process group during cleanup.

This function is registered to execute at exit to ensure the process group is properly destroyed. It temporarily ignores SIGINT to avoid interruption during cleanup.

For MoE runs that use DeepEP, the process-global DeepEP Buffer (NVSHMEM symmetric memory plus its own NCCL sub-groups) is destroyed before destroy_process_group(). Tearing the buffer down first releases that pending collective state; otherwise destroy_process_group() hangs on it. Freeing the buffer is a no-op when DeepEP was never used.

nemo_automodel.components.distributed.init_utils.get_local_rank_preinit() -> int

Get the local rank from the environment variable, intended for use before full init.

Returns: int

The local rank of the current process.

nemo_automodel.components.distributed.init_utils.get_local_world_size_preinit() -> int

Get the local world size from the environment variable, intended for use before full init.

Returns: int

The local world size of the current process.

nemo_automodel.components.distributed.init_utils.get_rank_safe() -> int

Get the distributed rank safely, even if torch.distributed is not initialized.

Returns: int

The current process rank.

nemo_automodel.components.distributed.init_utils.get_world_size_safe() -> int

Get the distributed world size safely, even if torch.distributed is not initialized.

Returns: int

The total number of processes in the distributed job.

nemo_automodel.components.distributed.init_utils.initialize_distributed(
    backend,
    timeout_minutes = 1
)

Initialize the torch.distributed environment and core model parallel infrastructure.

This function sets the device based on the local rank, configures the process group, and calls torch.distributed.init_process_group with the appropriate parameters. It also registers a cleanup function to properly destroy the process group at exit.

Parameters:

backend

str

The backend to use for torch.distributed (e.g., ‘nccl’).

timeout_minutes

intDefaults to 1

Timeout (in minutes) for distributed initialization. Defaults to 1.

Returns:

An instance containing the distributed environment configuration.