MoE Hardware Configuration Reference#

Stable docs: docs/training/moe-optimization.md Card: card.yaml (co-located)

Quick Platform Playbook#

Platform

Typical MoE strategy

What usually matters most

H100

DeepEP + stronger PP + moderate TP

communication overlap and PP efficiency

B200

DeepEP + MXFP8 + careful PP layout

container quality and tuned comm settings

GB200

HybridEP + partial CUDA graphs + CPU cleanup

host overhead, topology-aware dispatch, memory headroom

GB300

HybridEP + newer FP8 and kernel stack

same GB200 playbook, usually with a higher ceiling

Rounded Performance Bands#

These are intentionally rounded so the document stays durable as the tracker moves. Treat them as planning ranges, not exact promises.

Workload family

Hardware

Typical band

Representative shape

DSV3, large-scale

H100

low-to-mid hundreds TFLOPS/GPU, high-teens MFU

TP2, EP64, PP8, DeepEP

DSV3, large-scale

B200

high-hundreds TFLOPS/GPU, mid-teens MFU

TP1, EP32, PP8, DeepEP

DSV3, large-scale

GB200

around 1K TFLOPS/GPU, low-20s MFU

TP1, EP64, PP4, HybridEP

DSV3, large-scale

GB300

above the GB200 band, often mid-20s MFU

TP1, EP64, PP4, HybridEP

Qwen3 235B

H100

low-300s TFLOPS/GPU, around 30% MFU

TP2, EP32, PP8, DeepEP

Qwen3 235B

GB200

high-hundreds TFLOPS/GPU in tuned runs

TP1 or TP2, EP32-64, PP4, HybridEP

Qwen3 30B

H100

low-200s TFLOPS/GPU

TP1, EP8, PP1, DeepEP

Qwen3-Next 80B

GB200

low-300s TFLOPS/GPU in BF16-class runs

TP1, EP32, PP2, HybridEP

Representative Config Families#

DSV3 on H100#

Dispatcher: DeepEP
TP=2  EP=64  PP=8  VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficient

DSV3 on B200#

Dispatcher: DeepEP
TP=1  EP=32  PP=8  VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuning

DSV3 on GB200 or GB300#

Dispatcher: HybridEP
TP=1  EP=64  PP=4  VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapes

Qwen3 235B on H100#

Dispatcher: DeepEP
TP=2  EP=32  PP=8  VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanup

Qwen3 235B on GB200#

Dispatcher: HybridEP
TP=1 or 2  EP=32 to 64  PP=4
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroom

Qwen3-Next 80B on GB200#

Dispatcher: HybridEP
TP=1  EP=32  PP=2  VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM quality

Cross-Cutting Patterns#

PP layout#

  • E = embedding

  • t = transformer

  • m = MTP

  • L = loss

  • | = stage boundary

The biggest platform difference is usually not just the dispatcher. It is the combination of dispatcher, PP shape, and whether VPP keeps each stage balanced.

Recompute strategy#

Memory pressure

Starting point

low

none or a very narrow selective set

moderate

moe_act, mlp, norm, or similar selective modules

high

model-specific up-projection plus selective MoE and MLP modules

extreme or long-context

full recompute only if the selective path still does not fit

Environment variables#

CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32   # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0

CPU-side tuning#

On GB200 and GB300, CPU affinity and general host-overhead cleanup can move the needle almost as much as a dispatcher swap. Treat them as first-class tuning work, not as afterthoughts.

Pitfalls#

  1. Do not cargo-cult a tracker row: the winning config usually depends on routing mode, container, and PP layout as much as on hardware name.

  2. Container quality matters: large regressions can come from the software stack rather than the model recipe.

  3. VPP must be intentional: a bad VPP split can erase the gain from a better dispatcher.

  4. Compare absolute throughput, not only MFU: MFU can mislead when switching between BF16, FP8, and other precision modes.

  5. Force-balance routing is the safer benchmark default: keep routing mode fixed when comparing hardware or dispatcher stacks.