MoE Hardware Configuration Reference#
Stable docs: docs/training/moe-optimization.md
Card: card.yaml (co-located)
Quick Platform Playbook#
Platform |
Typical MoE strategy |
What usually matters most |
|---|---|---|
H100 |
DeepEP + stronger PP + moderate TP |
communication overlap and PP efficiency |
B200 |
DeepEP + MXFP8 + careful PP layout |
container quality and tuned comm settings |
GB200 |
HybridEP + partial CUDA graphs + CPU cleanup |
host overhead, topology-aware dispatch, memory headroom |
GB300 |
HybridEP + newer FP8 and kernel stack |
same GB200 playbook, usually with a higher ceiling |
Rounded Performance Bands#
These are intentionally rounded so the document stays durable as the tracker moves. Treat them as planning ranges, not exact promises.
Workload family |
Hardware |
Typical band |
Representative shape |
|---|---|---|---|
DSV3, large-scale |
H100 |
low-to-mid hundreds TFLOPS/GPU, high-teens MFU |
TP2, EP64, PP8, DeepEP |
DSV3, large-scale |
B200 |
high-hundreds TFLOPS/GPU, mid-teens MFU |
TP1, EP32, PP8, DeepEP |
DSV3, large-scale |
GB200 |
around 1K TFLOPS/GPU, low-20s MFU |
TP1, EP64, PP4, HybridEP |
DSV3, large-scale |
GB300 |
above the GB200 band, often mid-20s MFU |
TP1, EP64, PP4, HybridEP |
Qwen3 235B |
H100 |
low-300s TFLOPS/GPU, around 30% MFU |
TP2, EP32, PP8, DeepEP |
Qwen3 235B |
GB200 |
high-hundreds TFLOPS/GPU in tuned runs |
TP1 or TP2, EP32-64, PP4, HybridEP |
Qwen3 30B |
H100 |
low-200s TFLOPS/GPU |
TP1, EP8, PP1, DeepEP |
Qwen3-Next 80B |
GB200 |
low-300s TFLOPS/GPU in BF16-class runs |
TP1, EP32, PP2, HybridEP |
Representative Config Families#
DSV3 on H100#
Dispatcher: DeepEP
TP=2 EP=64 PP=8 VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficient
DSV3 on B200#
Dispatcher: DeepEP
TP=1 EP=32 PP=8 VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuning
DSV3 on GB200 or GB300#
Dispatcher: HybridEP
TP=1 EP=64 PP=4 VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapes
Qwen3 235B on H100#
Dispatcher: DeepEP
TP=2 EP=32 PP=8 VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanup
Qwen3 235B on GB200#
Dispatcher: HybridEP
TP=1 or 2 EP=32 to 64 PP=4
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroom
Qwen3-Next 80B on GB200#
Dispatcher: HybridEP
TP=1 EP=32 PP=2 VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM quality
Cross-Cutting Patterns#
PP layout#
E= embeddingt= transformerm= MTPL= loss|= stage boundary
The biggest platform difference is usually not just the dispatcher. It is the combination of dispatcher, PP shape, and whether VPP keeps each stage balanced.
Recompute strategy#
Memory pressure |
Starting point |
|---|---|
low |
none or a very narrow selective set |
moderate |
|
high |
model-specific up-projection plus selective MoE and MLP modules |
extreme or long-context |
full recompute only if the selective path still does not fit |
Environment variables#
CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32 # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0
CPU-side tuning#
On GB200 and GB300, CPU affinity and general host-overhead cleanup can move the needle almost as much as a dispatcher swap. Treat them as first-class tuning work, not as afterthoughts.
Pitfalls#
Do not cargo-cult a tracker row: the winning config usually depends on routing mode, container, and PP layout as much as on hardware name.
Container quality matters: large regressions can come from the software stack rather than the model recipe.
VPP must be intentional: a bad VPP split can erase the gain from a better dispatcher.
Compare absolute throughput, not only MFU: MFU can mislead when switching between BF16, FP8, and other precision modes.
Force-balance routing is the safer benchmark default: keep routing mode fixed when comparing hardware or dispatcher stacks.