MoE (Mixture of Experts)#
TensorRT includes built-in support for MoE (Mixture of Experts) in transformer models using the IMoELayer (C++, Python) API.
The IMoELayer is supported on SM100, SM103, and SM110.
Inputs
(index 0) hiddenStates: Activation tensor with shape[B, S, H](index 1) selectedExpertsForTokens: Indexes of chosen experts per token, with shape[B, S, topK](index 2) scoresForSelectedExperts: Scores per chosen expert for the weighted output, with shape[B, S, topK]
The following inputs are set using setGatedWeights. Each tensor corresponds to a layer in the expert’s GLU (gated linear unit), where I is the internal hidden size within each expert.
fcGateWeights: GLU gate weights, with shape[numExperts, H, I]fcUpWeights: GLU up-projection weights, with shape[numExperts, H, I]fcDownWeights: GLU down-projection weights, with shape[numExperts, I, H]
The following optional inputs are set using setGatedBiases:
fcGateBias: GLU gate biases, with shape[numExperts, I]fcUpBias: GLU up-projection biases, with shape[numExperts, I]fcDownBias: GLU down-projection biases, with shape[numExperts, H]
The following input is set using setQuantizationStatic:
fcDownActivationScale: Quantization scales for thefcDownactivation input (mul output), with shape[]or[numExperts]
Attributes
activationType: Activation type in the expert GLU. Supported values:NoneandSiLU. Set viasetGatedWeights.quantizationToType: Type to quantize the mul output (fcDowninput) to. Currently only FP8 is supported. Set viasetQuantizationStatic.
Note
Additional setters are available for advanced configuration:
setQuantizationDynamicDblQ: Configures double quantization in the dynamic-quantization path after the mul op. Takes thefcDownActivationDblQScale,dataType,blockShape, anddynQOutputScaleTypeparameters.setSwigluParams(limit, alpha, beta): Configures SwiGLU activation parameters. Defaults:limit = +inf,alpha = 1.0,beta = 0.0.
Output
The IMoELayer produces one output with the same shape as hiddenStates ([B, S, H]), representing the weighted sum of the top-K expert outputs.
Supported Configuration
Parameter |
Requirement |
|---|---|
Hardware |
SM100, SM103, and SM110 |
|
FP16, BF16, FP32, or FP8 static quantization via a DQ layer before the MoE layer. |
Weights (unquantized) |
FP16, BF16, or FP32 |
Weights (NVFP4 double quantized) |
Weights pass through a DQ layer (with scales from a second DQ), then optionally a |
Internal mul output ( |
Optional. FP8 static quantization via |