MoE (Mixture of Experts)#

TensorRT includes built-in support for MoE (Mixture of Experts) in transformer models using the IMoELayer (C++, Python) API.

The IMoELayer is supported on SM100, SM103, and SM110.

Inputs

  • (index 0) hiddenStates: Activation tensor with shape [B, S, H]

  • (index 1) selectedExpertsForTokens: Indexes of chosen experts per token, with shape [B, S, topK]

  • (index 2) scoresForSelectedExperts: Scores per chosen expert for the weighted output, with shape [B, S, topK]

The following inputs are set using setGatedWeights. Each tensor corresponds to a layer in the expert’s GLU (gated linear unit), where I is the internal hidden size within each expert.

  • fcGateWeights: GLU gate weights, with shape [numExperts, H, I]

  • fcUpWeights: GLU up-projection weights, with shape [numExperts, H, I]

  • fcDownWeights: GLU down-projection weights, with shape [numExperts, I, H]

The following optional inputs are set using setGatedBiases:

  • fcGateBias: GLU gate biases, with shape [numExperts, I]

  • fcUpBias: GLU up-projection biases, with shape [numExperts, I]

  • fcDownBias: GLU down-projection biases, with shape [numExperts, H]

The following input is set using setQuantizationStatic:

  • fcDownActivationScale: Quantization scales for the fcDown activation input (mul output), with shape [] or [numExperts]

Attributes

  • activationType: Activation type in the expert GLU. Supported values: None and SiLU. Set via setGatedWeights.

  • quantizationToType: Type to quantize the mul output (fcDown input) to. Currently only FP8 is supported. Set via setQuantizationStatic.

Note

Additional setters are available for advanced configuration:

  • setQuantizationDynamicDblQ: Configures double quantization in the dynamic-quantization path after the mul op. Takes the fcDownActivationDblQScale, dataType, blockShape, and dynQOutputScaleType parameters.

  • setSwigluParams(limit, alpha, beta): Configures SwiGLU activation parameters. Defaults: limit = +inf, alpha = 1.0, beta = 0.0.

Output

The IMoELayer produces one output with the same shape as hiddenStates ([B, S, H]), representing the weighted sum of the top-K expert outputs.

Supported Configuration

Parameter

Requirement

Hardware

SM100, SM103, and SM110

hiddenStates

FP16, BF16, FP32, or FP8 static quantization via a DQ layer before the MoE layer.

Weights (unquantized)

FP16, BF16, or FP32

Weights (NVFP4 double quantized)

Weights pass through a DQ layer (with scales from a second DQ), then optionally a Transpose, before the MoE layer. Scales must be FP8. Only available when hiddenStates uses FP8 static quantization.

Internal mul output (fcDown input)

Optional. FP8 static quantization via setQuantizationStatic, or FP4 dynamic double quantization via setQuantizationDynamicDblQ. When not set, no internal quantization is applied.