Is this page helpful?

Quantization Schemes#

INT8

INT8 quantization and dequantization operations are defined as follows:

\(x_{q}=quantize\left(x, s \right)=roundWithTiesToEven(clip(\frac{x}{s}, -128,127))\)

\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)

Where:

\({x}\) is a high-precision floating point value to be quantized.
\(x_{q}\) is a quantized INT8 value in range [-128,127].
\({s}\) is the quantization scale expressed using a 16-bit or 32-bit floating point scalar.
\({roundWithTiesToEven}\). Refer to Rounding.

FP8

FP8 quantization and dequantization operations are defined as follows:

\(x_{q}=quantize\left(x, s \right)=castToFp8(clip(\frac{x}{s}, -448,448))\)

\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)

Where:

\({x}\) is a high-precision floating point value to be quantized.
\(x_{q}\) is a quantized FP8E4M3 value in the range [-448, 448].
\({s}\) is the quantization scale expressed using a 16-bit or 32-bit floating point scalar.
\({castToFp8}\) rounds to the nearest value representable in FP8E4M3, ties are rounded to an even number. Refer to Rounding.

MXFP8

MXFP8 is a dynamic per-block quantization scheme. The output type is FP8E4M3, the scale type is E8M0, and the block size is 32. The quantization and dequantization formulas are identical to the FP8 quantization scheme.

When quantizing activations, Dynamic Quantization is required.

INT4

INT4 quantization and dequantization operations are defined as follows:

\(x_{q}=quantize\left(x, s \right)=roundWithTiesToEven(clip(\frac{x}{s}, -8,7))\)

\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)

Where:

\({x}\) is a high-precision floating point value to be quantized.
\(x_{q}\) is a quantized INT4 value in the range [-8, 7].
\({s}\) is the block’s quantization scale expressed using a 16-bit or 32-bit floating point.
\({roundWithTiesToEven}\). Refer to Rounding.

INT4 quantization requires per-block scales. The supported block sizes are {64, 128}. The block dimension should be one of the last two dimensions.

TensorRT only supports INT4 for weight quantization (Q/DQ Layer-Placement Recommendations).

NVFP4

NVFP4 quantization requires per-block scales. The only supported block size is 16. The block dimension should be one of the last two dimensions.

\(x_{q}=quantize\left(x, s \right)=castToFp4(clip(\frac{x}{s}, -6,6))\)

\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)

Where:

\({x}\) is a high-precision floating point value to be quantized.
\(x_{q}\) is a quantized FP4 value in the range [-6, 6].
\({s}\) is the block’s quantization scale expressed using a 16-bit or 32-bit floating point.
\({castToFp4}\) rounds to the nearest value representable in FP4E2M1, ties are rounded to an even number. Refer to Rounding.

When quantizing activations, Dynamic Quantization is required.

Table 11 Quantization Schemes and Precision#
Quantization Schemes	INT8	FP8	MXFP8	INT4	NVFP4
Representation	8-bit signed 2’s complement	S1E4M3 floating point	S1E4M3 floating point	4-bit signed 2’s complement	S1E2M1 floating point
Weight quantization	Per-tensor/per-axis	Per-tensor/per-axis	Per-block (block size = 32)	Per-block (block sizes = `{64, 128}`)	Per-block
Activation quantization	Per-tensor	Per-tensor	Dynamic, per-block (block size = 32)	No	Dynamic, per-block
Explicit quantization	Yes	Yes	Yes	Yes	Yes
Scale data type	FP32, FP16, BF16	FP32, FP16, BF16	E8MO	FP32, FP16, BF16	FP32, FP16, BF16