Is this page helpful?

MoE (Mixture of Experts)#

TensorRT includes built-in support for MoE (Mixture of Experts) in transformer models using the IMoELayer (C++, Python) API.

The IMoELayer is supported on SM100, SM103, and SM110.

Inputs

(index 0) hiddenStates: Activation tensor with shape [B, S, H]
(index 1) selectedExpertsForTokens: Indexes of chosen experts per token, with shape [B, S, topK]
(index 2) scoresForSelectedExperts: Scores per chosen expert for the weighted output, with shape [B, S, topK]

The following inputs are set using setGatedWeights. Each tensor corresponds to a layer in the expert’s GLU (gated linear unit), where I is the internal hidden size within each expert.

fcGateWeights: GLU gate weights, with shape [numExperts, H, I]
fcUpWeights: GLU up-projection weights, with shape [numExperts, H, I]
fcDownWeights: GLU down-projection weights, with shape [numExperts, I, H]

The following optional inputs are set using setGatedBiases:

fcGateBias: GLU gate biases, with shape [numExperts, I]
fcUpBias: GLU up-projection biases, with shape [numExperts, I]
fcDownBias: GLU down-projection biases, with shape [numExperts, H]

The following input is set using setQuantizationStatic:

fcDownActivationScale: Quantization scales for the fcDown activation input (mul output), with shape [] or [numExperts]

Attributes

activationType: Activation type in the expert GLU. Supported values: None and SiLU. Set via setGatedWeights.
quantizationToType: Type to quantize the mul output (fcDown input) to. Currently only FP8 is supported. Set via setQuantizationStatic.

Note

Additional setters are available for advanced configuration:

setQuantizationDynamicDblQ: Configures double quantization in the dynamic-quantization path after the mul op. Takes the fcDownActivationDblQScale, dataType, blockShape, and dynQOutputScaleType parameters.
setSwigluParams(limit, alpha, beta): Configures SwiGLU activation parameters. Defaults: limit = +inf, alpha = 1.0, beta = 0.0.

Output

The IMoELayer produces one output with the same shape as hiddenStates ([B, S, H]), representing the weighted sum of the top-K expert outputs.

Supported Configuration

Parameter	Requirement
Hardware	SM100, SM103, and SM110
`hiddenStates`	FP16, BF16, FP32, or FP8 static quantization via a DQ layer before the MoE layer.
Weights (unquantized)	FP16, BF16, or FP32
Weights (NVFP4 double quantized)	Weights pass through a DQ layer (with scales from a second DQ), then optionally a `Transpose`, before the MoE layer. Scales must be FP8. Only available when `hiddenStates` uses FP8 static quantization.
Internal mul output (`fcDown` input)	Optional. FP8 static quantization via `setQuantizationStatic`, or FP4 dynamic double quantization via `setQuantizationDynamicDblQ`. When not set, no internal quantization is applied.