nemo_automodel.components.moe.parallelizer
nemo_automodel.components.moe.parallelizer
Module Contents
Classes
Functions
Data
API
Bases: ParallelStyle
ExpertParallel class is used to shard the MoE parameters on the EP mesh.
Dim 0 of each parameter is sharded since that is the expert dimension.
Return the model-level MoE config exposed by custom MoE architectures.
Return True when the AC mode requests selective checkpointing.
Kept inline (rather than imported from the dense FSDP2 parallelizer) so that
threading the mode does not pull the heavy distributed.parallelizer module
into the lightweight call path.
Yield decoder blocks that may contain MoE sublayers.
Covers the main backbone (backbone.layers) plus an optional MTP
auxiliary head (model_wrapper.mtp.layers) when present. MTP sublayers
are not registered under backbone.layers but carry the same MoE
structure and must receive the same EP / FSDP treatment so their
state-dict round-trips cleanly.
Parameters:
Outer model (e.g. NemotronHForCausalLM) — the
attribute that may carry the MTP head.
Inner backbone (model_wrapper.model, possibly text-only
after VLM unwrapping) whose .layers holds the main decoder
stack.
Return True when two modules expose the same weight parameter object.
FSDP shard placement for grouped-expert params.
Shard on dim=1 for the (>=2D) expert weights since there may be more shards than experts (dim=0). A 1D param (e.g. the per-expert bias of the experts=“te” GroupedLinear path, shape [out_features]) has no dim 1, so shard it on dim 0 instead. FSDP all-gathers before use, so the shard dim is a storage detail and does not change compute.
Shard each _fp32_params holder in block as its own fp32 FSDP unit.
Model implementations own the architecture-specific decision to create these
holders (for example Qwen3.5/Qwen3-Next GatedDeltaNet A_log/dt_bias).
FSDP only treats the holder as a dtype-uniform fp32 unit and excludes its params
from the block’s bf16 FSDP unit.
Returns the set of holder parameters to exclude from the block’s FSDP wrap.
Blocks that do not expose named_modules (e.g. non-nn.Module test
stubs) cannot hold fp32 holders, so an empty set is returned.
Apply activation checkpointing to the model.
Parameters:
The model to apply activation checkpointing to.
If True (the default), saves the MoE router output so the dispatch is not recomputed under activation checkpointing (avoids a CheckpointError from non-deterministic re-routing on recompute). If False, a warning is emitted.
Hidden dimension size. If None, derived from model.config.hidden_size.
Number of routed experts. If None, derived from moe_config.n_routed_experts first, then falls back to model.config attributes.
If True, applies TorchTitan-style per-op selective activation checkpointing
(shared with the dense FSDP2 path) to each block. Takes precedence over
ignore_router; the shared policy already saves expert-parallel communication
collectives and topk, so it composes with expert parallelism.
Configure context parallelism for attention and MoE layers.
Applies EP to MoE module.
Apply FSDP wrapping to MoE transformer blocks and model-level modules.
Apply context, expert, activation-checkpointing, and FSDP parallelism.