> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.distributed.init_utils

## Module Contents

### Classes

| Name                                                                     | Description                                                   |
| ------------------------------------------------------------------------ | ------------------------------------------------------------- |
| [`DistInfo`](#nemo_automodel-components-distributed-init_utils-DistInfo) | Holds information about the distributed training environment. |

### Functions

| Name                                                                                                             | Description                                                                                |
| ---------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
| [`destroy_global_state`](#nemo_automodel-components-distributed-init_utils-destroy_global_state)                 | Destroy the torch.distributed process group during cleanup.                                |
| [`get_local_rank_preinit`](#nemo_automodel-components-distributed-init_utils-get_local_rank_preinit)             | Get the local rank from the environment variable, intended for use before full init.       |
| [`get_local_world_size_preinit`](#nemo_automodel-components-distributed-init_utils-get_local_world_size_preinit) | Get the local world size from the environment variable, intended for use before full init. |
| [`get_rank_safe`](#nemo_automodel-components-distributed-init_utils-get_rank_safe)                               | Get the distributed rank safely, even if torch.distributed is not initialized.             |
| [`get_world_size_safe`](#nemo_automodel-components-distributed-init_utils-get_world_size_safe)                   | Get the distributed world size safely, even if torch.distributed is not initialized.       |
| [`initialize_distributed`](#nemo_automodel-components-distributed-init_utils-initialize_distributed)             | Initialize the torch.distributed environment and core model parallel infrastructure.       |

### API

```python
class nemo_automodel.components.distributed.init_utils.DistInfo(
    backend: str,
    rank: int,
    world_size: int,
    device: torch.device,
    is_main: bool
)
```

Dataclass

Holds information about the distributed training environment.

```python
nemo_automodel.components.distributed.init_utils.destroy_global_state()
```

Destroy the torch.distributed process group during cleanup.

This function is registered to execute at exit to ensure the process group is properly destroyed.
It temporarily ignores SIGINT to avoid interruption during cleanup.

For MoE runs that use DeepEP, the process-global DeepEP `Buffer` (NVSHMEM symmetric memory
plus its own NCCL sub-groups) is destroyed *before* `destroy_process_group()`. Tearing the
buffer down first releases that pending collective state; otherwise `destroy_process_group()`
hangs on it. Freeing the buffer is a no-op when DeepEP was never used.

```python
nemo_automodel.components.distributed.init_utils.get_local_rank_preinit() -> int
```

Get the local rank from the environment variable, intended for use before full init.

**Returns:** `int`

The local rank of the current process.

```python
nemo_automodel.components.distributed.init_utils.get_local_world_size_preinit() -> int
```

Get the local world size from the environment variable, intended for use before full init.

**Returns:** `int`

The local world size of the current process.

```python
nemo_automodel.components.distributed.init_utils.get_rank_safe() -> int
```

Get the distributed rank safely, even if torch.distributed is not initialized.

**Returns:** `int`

The current process rank.

```python
nemo_automodel.components.distributed.init_utils.get_world_size_safe() -> int
```

Get the distributed world size safely, even if torch.distributed is not initialized.

**Returns:** `int`

The total number of processes in the distributed job.

```python
nemo_automodel.components.distributed.init_utils.initialize_distributed(
    backend,
    timeout_minutes = 1
)
```

Initialize the torch.distributed environment and core model parallel infrastructure.

This function sets the device based on the local rank, configures the process group,
and calls torch.distributed.init\_process\_group with the appropriate parameters.
It also registers a cleanup function to properly destroy the process group at exit.

**Parameters:**

The backend to use for torch.distributed (e.g., 'nccl').

Timeout (in minutes) for distributed initialization. Defaults to 1.

**Returns:**

An instance containing the distributed environment configuration.