nemo_automodel.components.distributed.mesh#
Typed MeshContext dataclass, validation, and strategy map.
MeshContext is the single source of truth for everything related to
distributed training: strategy config, device meshes, and axis names.
Parallelism sizes (tp_size, pp_size, etc.) are derived at runtime
from the attached DeviceMesh objects via @property. When no mesh
is present the properties return safe defaults (1 for sizes, None for
dp / hsdp).
All inputs and outputs are typed Python objects (dataclasses, enums, etc.).
YAML / dict parsing belongs in the recipe layer — see
nemo_automodel.recipes._dist_setup.
Module Contents#
Classes#
Canonical mesh-dimension names used by |
|
Runtime distributed training context: configs + device meshes. |
Functions#
Return the size of axis if present in mesh, else default. |
|
Return axis if present in mesh, else |
|
Ensure every dimension name in the attached meshes is a :class: |
|
Validate cross-field constraints on a :class: |
Data#
API#
- nemo_automodel.components.distributed.mesh.STRATEGY_MAP: Dict[str, type]#
None
- class nemo_automodel.components.distributed.mesh.MeshAxisName#
Bases:
str,enum.EnumCanonical mesh-dimension names used by
DeviceMeshand helpers.Inherits from
strso each member compares equal to (and can be used wherever) a plain string — e.g.MeshAxisName.TP == "tp".Initialization
Initialize self. See help(type(self)) for accurate signature.
- PP#
‘pp’
- DP#
‘dp’
- DP_REPLICATE#
‘dp_replicate’
- DP_SHARD#
‘dp_shard’
- DP_SHARD_CP#
‘dp_shard_cp’
- DP_CP#
‘dp_cp’
- CP#
‘cp’
- TP#
‘tp’
- EP#
‘ep’
- EP_SHARD#
‘ep_shard’
- nemo_automodel.components.distributed.mesh._VALID_AXIS_NAMES: frozenset#
‘frozenset(…)’
- class nemo_automodel.components.distributed.mesh.MeshContext#
Runtime distributed training context: configs + device meshes.
Parallelism sizes (
tp_size,pp_size, etc.) are not stored as fields; they are@propertyaccessors that read directly from the attachedDeviceMesh/moe_mesh. When no mesh is present the properties return safe defaults (1for sizes,Nonefor dp / hsdp).All
DeviceMeshobjects passed in must use dimension names from- Class:
MeshAxisName; aValueErroris raised on construction if any unknown name is encountered.
Lifecycle#
Recipes parse YAML to obtain sizes and strategy configs.
Sizes are passed to
create_device_meshto buildDeviceMeshobjects.MeshContextis created with those meshes; dimension names are validated automatically in__post_init__.
Alternatively, :meth:
from_meshesconstructs an instance directly fromDeviceMeshobjects (used byNeMoAutoModel.from_pretrained)... attribute:: strategy_config
Strategy-specific config (FSDP2, MegatronFSDP, or DDP).
.. attribute:: device_mesh
Device mesh for distributed training.
.. attribute:: moe_mesh
MoE-specific device mesh.
.. attribute:: pipeline_config
Pipeline-parallel schedule/splitting config.
.. attribute:: moe_config
MoE parallelizer settings.
.. attribute:: activation_checkpointing
Whether activation checkpointing is enabled.
- strategy_config: Optional[Union[nemo_automodel.components.distributed.config.FSDP2Config, nemo_automodel.components.distributed.config.MegatronFSDPConfig, nemo_automodel.components.distributed.config.DDPConfig]]#
None
- pipeline_config: Optional[nemo_automodel.components.distributed.pipelining.config.PipelineConfig]#
None
- moe_config: Optional[nemo_automodel.components.moe.config.MoEParallelizerConfig]#
None
- activation_checkpointing: bool#
False
- device_mesh: Optional[torch.distributed.device_mesh.DeviceMesh]#
‘field(…)’
- moe_mesh: Optional[torch.distributed.device_mesh.DeviceMesh]#
‘field(…)’
- __post_init__() None#
- property pp_size: int#
Pipeline-parallel degree (from
device_mesh, default1).
- property pp_enabled: bool#
Truewhenpp_size > 1.
- property tp_size: int#
Tensor-parallel degree (from
device_mesh, default1).
- property cp_size: int#
Context-parallel degree (from
device_mesh, default1).
- property ep_size: int#
Expert-parallel degree (from
moe_mesh, default1).
- property dp_size: Optional[int]#
Data-parallel degree (from
device_mesh, defaultNone).
- property dp_replicate_size: Optional[int]#
HSDP replication degree (from
device_mesh, defaultNone).
- _dp_axis_names() Tuple[str, ...]#
DP axis names for FSDP mesh slicing.
- pipeline_axis_kwargs() Dict[str, object]#
Axis-name kwargs for
AutoPipeline.
- parallelize_axis_kwargs() Dict[str, object]#
Axis-name kwargs for
parallelize_fn(EP/FSDP, nopp_axis_name).
- classmethod from_meshes(
- device_mesh: Optional[torch.distributed.device_mesh.DeviceMesh],
- moe_mesh: Optional[torch.distributed.device_mesh.DeviceMesh] = None,
- *,
- strategy_config: Optional[Union[nemo_automodel.components.distributed.config.FSDP2Config, nemo_automodel.components.distributed.config.MegatronFSDPConfig, nemo_automodel.components.distributed.config.DDPConfig]] = None,
- pipeline_config: Optional[nemo_automodel.components.distributed.pipelining.config.PipelineConfig] = None,
- moe_config: Optional[nemo_automodel.components.moe.config.MoEParallelizerConfig] = None,
- activation_checkpointing: bool = False,
Build a :class:
MeshContextfromDeviceMeshobjects.This is the entry-point used by
NeMoAutoModel.from_pretrained/from_configwhere the caller has raw meshes rather than a parsed YAML config.
- nemo_automodel.components.distributed.mesh._get_axis_size(
- mesh: Optional[torch.distributed.device_mesh.DeviceMesh],
- axis: nemo_automodel.components.distributed.mesh.MeshAxisName,
- default=1,
Return the size of axis if present in mesh, else default.
- nemo_automodel.components.distributed.mesh._optional_axis(
- mesh: Optional[torch.distributed.device_mesh.DeviceMesh],
- axis: nemo_automodel.components.distributed.mesh.MeshAxisName,
Return axis if present in mesh, else
None.
- nemo_automodel.components.distributed.mesh._validate_mesh_dim_names(
- mesh_context: nemo_automodel.components.distributed.mesh.MeshContext,
Ensure every dimension name in the attached meshes is a :class:
MeshAxisName.
- nemo_automodel.components.distributed.mesh._validate_distributed_setup(
- mesh_context: nemo_automodel.components.distributed.mesh.MeshContext,
Validate cross-field constraints on a :class:
MeshContext.Called automatically by
MeshContext.__post_init__when astrategy_configis present. Can also be invoked explicitly after mutating a context.- Raises:
ValueError – If any constraint is violated.