`nemo_automodel.components.moe.parallelizer`#

Module Contents#

Classes#

ExpertParallel

ExpertParallel class is used to shard the MoE parameters on the EP mesh. Dim 0 of each parameter is sharded since that is the expert dimension.

Functions#

`_is_deepseek_v4_model`
`_get_cp_stream`
`_iter_transformer_and_mtp_blocks`
`_get_moe_module`
`_iter_moe_blocks`	Yield decoder blocks that may contain MoE sublayers.
`apply_ep`	Applies EP to MoE module.
`apply_ac`	Apply activation checkpointing to the model.
`apply_fsdp`	Apply FSDP wrapping to MoE transformer blocks and model-level modules.
`apply_cp`	Configure context parallelism for attention and MoE layers.
`parallelize_model`	Apply context, expert, activation-checkpointing, and FSDP parallelism.

Data#

`logger`
`_CP_STREAM`

API#

nemo_automodel.components.moe.parallelizer.logger#: ‘getLogger(…)’

nemo_automodel.components.moe.parallelizer._CP_STREAM#: None

nemo_automodel.components.moe.parallelizer._is_deepseek_v4_model(model: torch.nn.Module) → bool#

nemo_automodel.components.moe.parallelizer._get_cp_stream() → torch.cuda.Stream#

nemo_automodel.components.moe.parallelizer._iter_transformer_and_mtp_blocks(model: torch.nn.Module)#

nemo_automodel.components.moe.parallelizer._get_moe_module( block: torch.nn.Module, ) → nemo_automodel.components.moe.layers.MoE | None#

class nemo_automodel.components.moe.parallelizer.ExpertParallel#

Bases: torch.distributed.tensor.parallel.ParallelStyle

ExpertParallel class is used to shard the MoE parameters on the EP mesh. Dim 0 of each parameter is sharded since that is the expert dimension.

_partition_fn(name, module, device_mesh)#

_apply( module: torch.nn.Module, device_mesh: torch.distributed.device_mesh.DeviceMesh, ) → torch.nn.Module#

nemo_automodel.components.moe.parallelizer._iter_moe_blocks( model_wrapper: torch.nn.Module, backbone: torch.nn.Module, )#

Yield decoder blocks that may contain MoE sublayers.

Covers the main backbone (backbone.layers) plus an optional MTP auxiliary head (model_wrapper.mtp.layers) when present. MTP sublayers are not registered under backbone.layers but carry the same MoE structure and must receive the same EP / FSDP treatment so their state-dict round-trips cleanly.

Parameters:

model_wrapper – Outer model (e.g. NemotronHForCausalLM) — the attribute that may carry the MTP head.
backbone – Inner backbone (model_wrapper.model, possibly text-only after VLM unwrapping) whose .layers holds the main decoder stack.

nemo_automodel.components.moe.parallelizer.apply_ep( model: torch.nn.Module, ep_mesh: torch.distributed.device_mesh.DeviceMesh, moe_mesh: torch.distributed.device_mesh.DeviceMesh | None = None, )#: Applies EP to MoE module.

nemo_automodel.components.moe.parallelizer.apply_ac( model: torch.nn.Module, ignore_router: bool = False, hidden_size: int | None = None, num_experts: int | None = None, )#

Apply activation checkpointing to the model.