MoE Training Optimization Workflow#

Stable docs: docs/training/moe-optimization.md Card: card.yaml (co-located) Source: Scalable Training of MoE Models with Megatron Core

Quick Reference#

Think in terms of the paper’s Three Walls:

memory wall
communication wall
compute and host-overhead wall

MoE tuning is iterative. Fixing one wall usually exposes the next one, so the best workflow is: fit first, scale second, profile third, then retune.

Phase 1: Make The Run Memory-Feasible#

Start with a configuration that fits reliably before chasing throughput.

Recommended order:

Use the smallest amount of model parallelism that still fits.
Turn on selective recompute before falling back to full recompute.
Add offloading only when recompute and parallelism are still insufficient.
Use --fake-init-process-group to sanity-check large parallel layouts on a single GPU before burning cluster time.

Recompute guidance#

Prefer selective recompute for MoE runs:

good first choices: layernorm, core_attn, moe_act, mlp, or model-specific modules (shared_experts, mla_up_proj)
use full recompute only when the run still does not fit
revisit recompute after enabling CUDA graphs, because some graph scopes and full recompute paths do not mix well

As a rule of thumb, fine-grained recompute often recovers most of the needed memory while keeping throughput much closer to the non-recompute baseline than full-layer recompute does.

Phase 2: Choose Parallelism For Scale#

Priority order:

Maximize DP once the model fits.
Keep the hot communication path inside the fast interconnect when possible.
Use PP, plus VPP if needed, for multi-node scaling.
Prefer EP over extra TP for expert layers.
Add CP for long context once sequence length makes attention memory dominant.

Parallel Folding#

Parallel Folding decouples attention and MoE parallelism so you do not have to pick a single compromise layout:

Attention: TP × CP × DP × PP
MoE:       ETP × EP × EDP × PP

Key knobs:

--expert-model-parallel-size
--expert-tensor-parallel-size

Use it when attention prefers some TP or CP, but expert layers benefit from a larger EP degree than the dense layers can tolerate.

Phase 3: Profile The Dominant Bottleneck#

Bottleneck	What it looks like	Primary fixes
Memory	Run fits only with aggressive full recompute or OOMs during warmup	selective recompute, FP8, offloading, better PP layout
Communication	Nsight shows large all-to-all or collective blocks	DeepEP or HybridEP, EP overlap, DP/TP overlap, better PP layout
Host overhead	GPU gaps, launch-bound traces, Python overhead	CUDA graphs, `--manual-gc`, higher MBS, CPU affinity tuning
Compute	Low SM utilization after comm and host issues are addressed	grouped GEMM, fusion work, FP8, dispatcher-specific kernel tuning

Dispatcher And Overlap Guidance#

Use dispatcher choice as a bottleneck fix, not as the first tuning knob.

moe_token_dispatcher_type="alltoall": safest bring-up path, fine for smaller EP sizes
moe_token_dispatcher_type="flex" + moe_flex_dispatcher_backend="deepep": strong default for H100 and B200 style deployments
moe_token_dispatcher_type="flex" + moe_flex_dispatcher_backend="hybridep": strongest starting point on GB200 or GB300 NVL72 systems

If the all-to-all path is visible in profiles, combine dispatcher tuning with:

--overlap-moe-expert-parallel-comm
--overlap-grad-reduce
--tp-comm-overlap

FP8 Recipe Quick Decision#

Platform	Recommended starting recipe
Hopper	FP8 blockwise
Blackwell	MXFP8
Blackwell, speed-first exploration	NVFP4 after the BF16 or FP8 path is stable

Keep the router in FP32. The largest wins usually come from expert GEMMs and other heavy matrix math, not from trying to quantize every small MoE component.

CUDA Graphs For MoE#

For dropless MoE, start with partial TE-scoped graphs:

attn
moe_router
moe_preprocess

That path usually gives a meaningful step-time win while keeping the dynamic expert work outside the graph. Expect a moderate speedup when launch overhead is visible, but budget several extra GB of memory and verify that shapes remain static.

Use full-iteration graphs only for graph-friendly workloads such as drop-and-pad or tightly controlled static-shape experiments.

Related references:

skills/perf-techniques/cuda-graphs/SKILL.md
docs/training/cuda-graphs.md
docs/training/activation-recomputation.md

Pitfalls#

Do not optimize in the wrong order: fitting the model and selecting sane parallelism matter more than micro-optimizations.
Platform changes the limiting wall: H100-class runs often feel more communication-bound, while GB200 or GB300 runs often expose CPU or launch overhead earlier.
FP8 MFU can look misleadingly low: compare absolute throughput as well as MFU when switching precision modes.
CUDA graphs and recompute interact: TE-scoped graphs are usually paired with selective recompute, not blanket full recompute.
Parallel Folding is not optional at large scale: once attention and expert layers want clearly different layouts, a single shared TP or EP plan becomes a tax on both.