Quantization Schemes#
INT8 quantization and dequantization operations are defined as follows:
\(x_{q}=quantize\left(x, s \right)=roundWithTiesToEven(clip(\frac{x}{s}, -128,127))\)
\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)
Where:
\({x}\) is a high-precision floating point value to be quantized.
\(x_{q}\) is a quantized INT8 value in range
[-128,127].\({s}\) is the quantization scale expressed using a 16-bit or 32-bit floating point scalar.
\({roundWithTiesToEven}\). Refer to Rounding.
FP8 quantization and dequantization operations are defined as follows:
\(x_{q}=quantize\left(x, s \right)=castToFp8(clip(\frac{x}{s}, -448,448))\)
\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)
Where:
\({x}\) is a high-precision floating point value to be quantized.
\(x_{q}\) is a quantized FP8E4M3 value in the range
[-448, 448].\({s}\) is the quantization scale expressed using a 16-bit or 32-bit floating point scalar.
\({castToFp8}\) rounds to the nearest value representable in FP8E4M3, ties are rounded to an even number. Refer to Rounding.
MXFP8 is a dynamic per-block quantization scheme. The output type is FP8E4M3, the scale type is E8M0, and the block size is 32. The quantization and dequantization formulas are identical to the FP8 quantization scheme.
When quantizing activations, Dynamic Quantization is required.
INT4 quantization and dequantization operations are defined as follows:
\(x_{q}=quantize\left(x, s \right)=roundWithTiesToEven(clip(\frac{x}{s}, -8,7))\)
\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)
Where:
\({x}\) is a high-precision floating point value to be quantized.
\(x_{q}\) is a quantized INT4 value in the range
[-8, 7].\({s}\) is the block’s quantization scale expressed using a 16-bit or 32-bit floating point.
\({roundWithTiesToEven}\). Refer to Rounding.
INT4 quantization requires per-block scales. The supported block sizes are {64, 128}. The block dimension should be one of the last two dimensions.
TensorRT only supports INT4 for weight quantization (Q/DQ Layer-Placement Recommendations).
NVFP4 quantization requires per-block scales. The only supported block size is 16. The block dimension should be one of the last two dimensions.
\(x_{q}=quantize\left(x, s \right)=castToFp4(clip(\frac{x}{s}, -6,6))\)
\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)
Where:
\({x}\) is a high-precision floating point value to be quantized.
\(x_{q}\) is a quantized FP4 value in the range
[-6, 6].\({s}\) is the block’s quantization scale expressed using a 16-bit or 32-bit floating point.
\({castToFp4}\) rounds to the nearest value representable in FP4E2M1, ties are rounded to an even number. Refer to Rounding.
When quantizing activations, Dynamic Quantization is required.
Quantization Schemes |
INT8 |
FP8 |
MXFP8 |
INT4 |
NVFP4 |
|---|---|---|---|---|---|
Representation |
8-bit signed 2’s complement |
S1E4M3 floating point |
|
4-bit signed 2’s complement |
S1E2M1 floating point |
Weight quantization |
Per-tensor/per-axis |
Per-tensor/per-axis |
Per-block (block size = 32) |
Per-block (block sizes = |
Per-block |
Activation quantization |
Per-tensor |
Per-tensor |
Dynamic, per-block (block size = 32) |
No |
Dynamic, per-block |
Explicit quantization |
Yes |
Yes |
Yes |
Yes |
Yes |
Scale data type |
FP32, FP16, BF16 |
FP32, FP16, BF16 |
E8MO |
FP32, FP16, BF16 |
FP32, FP16, BF16 |