`nemo_automodel.components.distributed.mesh_utils`#

Device mesh creation utilities for distributed training.

This module provides a central function to create device meshes based on the distributed config type (FSDP2, MegatronFSDP, or DDP).

Usage: from nemo_automodel.components.distributed.config import FSDP2Config from nemo_automodel.components.distributed.mesh_utils import create_device_mesh

config = FSDP2Config(sequence_parallel=True)
device_mesh, moe_mesh = create_device_mesh(
    config,
    tp_size=2,
    pp_size=1,
    dp_replicate_size=2,
    world_size=8,
)

Module Contents#

Functions#

`create_device_mesh`	Create device mesh based on distributed config type.
`_create_fsdp2_device_mesh`	Create device mesh for FSDP2.
`_create_megatron_fsdp_device_mesh`	Create device mesh for MegatronFSDP.
`_create_moe_mesh`	Create MOE mesh for expert parallelism.

API#

nemo_automodel.components.distributed.mesh_utils.create_device_mesh( distributed_config: Union[nemo_automodel.components.distributed.config.FSDP2Config, nemo_automodel.components.distributed.config.MegatronFSDPConfig, nemo_automodel.components.distributed.config.DDPConfig], *, dp_size: Optional[int] = None, dp_replicate_size: Optional[int] = None, tp_size: int = 1, pp_size: int = 1, cp_size: int = 1, ep_size: int = 1, world_size: int, ) → Tuple[Optional[torch.distributed.device_mesh.DeviceMesh], Optional[torch.distributed.device_mesh.DeviceMesh]]#

Create device mesh based on distributed config type.

Routes to the appropriate mesh creation logic based on config type.

Parameters:

distributed_config – The distributed config (FSDP2Config, MegatronFSDPConfig, or DDPConfig).
dp_size – Data parallel size. If None, inferred from world_size and other parallelism sizes.
dp_replicate_size – FSDP2-only. Size of the replication group for HSDP (Hybrid Sharded Data Parallel). If None or <= 0, defaults to 1. Must be a divisor of dp_size.
tp_size – Tensor parallel size.
pp_size – Pipeline parallel size.
cp_size – Context parallel size.
ep_size – Expert parallel size (for MoE models).
world_size – Total number of processes.

Returns:

(device_mesh, moe_mesh) - For FSDP2Config: Full device mesh + optional moe_mesh (if ep_size > 1) - For MegatronFSDPConfig: Device mesh + None - For DDPConfig: (None, None) - DDP doesn’t use device mesh

Return type:

tuple

Raises:

ValueError – If dp_replicate_size is provided with non-FSDP2 config.
ValueError – If world_size is not divisible by parallelism sizes.

nemo_automodel.components.distributed.mesh_utils._create_fsdp2_device_mesh( dp_size: Optional[int], dp_replicate_size: Optional[int], tp_size: int, pp_size: int, cp_size: int, ep_size: int, world_size: int, backend: str, ) → Tuple[torch.distributed.device_mesh.DeviceMesh, Optional[torch.distributed.device_mesh.DeviceMesh]]#

Create device mesh for FSDP2.

Mesh shape: (pp_size, dp_replicate_size, dp_shard_size, cp_size, tp_size) Mesh names: (“pp”, “dp_replicate”, “dp_shard”, “cp”, “tp”)

Also creates flattened submeshes: - “dp”: dp_replicate + dp_shard - “dp_shard_cp”: dp_shard + cp - “dp_cp”: dp_replicate + dp_shard + cp

Parameters:

dp_size – Data parallel size. If None, inferred from world_size.
dp_replicate_size – Size of the replication group for HSDP.
tp_size – Tensor parallel size.
pp_size – Pipeline parallel size.
cp_size – Context parallel size.
ep_size – Expert parallel size (for MoE models).
world_size – Total number of processes.
backend – Distributed backend (‘nccl’ or ‘gloo’).

Returns:

(device_mesh, moe_mesh)

Return type:

tuple

nemo_automodel.components.distributed.mesh_utils._create_megatron_fsdp_device_mesh( dp_size: Optional[int], tp_size: int, cp_size: int, world_size: int, backend: str, ) → torch.distributed.device_mesh.DeviceMesh#

Create device mesh for MegatronFSDP.

Mesh shape: (dp_size, cp_size, tp_size) Mesh names: (“dp”, “cp”, “tp”)

Also creates flattened submesh “dp_cp” if cp_size > 1.

Parameters:

dp_size – Data parallel size. If None, inferred from world_size.
tp_size – Tensor parallel size.
cp_size – Context parallel size.
world_size – Total number of processes.
backend – Distributed backend (‘nccl’ or ‘gloo’).

Returns:

The device mesh for MegatronFSDP.

Return type:

DeviceMesh

nemo_automodel.components.distributed.mesh_utils._create_moe_mesh( pp_size: int, ep_shard_size: int, ep_size: int, backend: str, ) → torch.distributed.device_mesh.DeviceMesh#

Create MOE mesh for expert parallelism.

Mesh shape: (pp_size, ep_shard_size, ep_size) Mesh names: (“pp”, “ep_shard”, “ep”)

Parameters:

pp_size – Pipeline parallel size.
ep_shard_size – Expert shard size (dp_cp_size // ep_size).
ep_size – Expert parallel size.
backend – Distributed backend (‘nccl’ or ‘gloo’).

Returns:

The MOE mesh for expert parallelism.

Return type:

DeviceMesh

nemo_automodel.components.distributed.mesh_utils#

Module Contents#

Functions#

API#

`nemo_automodel.components.distributed.mesh_utils`#