Quantization Schemes#

INT8 quantization and dequantization operations are defined as follows:

\(x_{q}=quantize\left(x, s \right)=roundWithTiesToEven(clip(\frac{x}{s}, -128,127))\)

\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)

Where:

  • \({x}\) is a high-precision floating point value to be quantized.

  • \(x_{q}\) is a quantized INT8 value in range [-128,127].

  • \({s}\) is the quantization scale expressed using a 16-bit or 32-bit floating point scalar.

  • \({roundWithTiesToEven}\). Refer to Rounding.

FP8 quantization and dequantization operations are defined as follows:

\(x_{q}=quantize\left(x, s \right)=castToFp8(clip(\frac{x}{s}, -448,448))\)

\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)

Where:

  • \({x}\) is a high-precision floating point value to be quantized.

  • \(x_{q}\) is a quantized FP8E4M3 value in the range [-448, 448].

  • \({s}\) is the quantization scale expressed using a 16-bit or 32-bit floating point scalar.

  • \({castToFp8}\) rounds to the nearest value representable in FP8E4M3, ties are rounded to an even number. Refer to Rounding.

MXFP8 is a dynamic per-block quantization scheme. The output type is FP8E4M3, the scale type is E8M0, and the block size is 32. The quantization and dequantization formulas are identical to the FP8 quantization scheme.

When quantizing activations, Dynamic Quantization is required.

INT4 quantization and dequantization operations are defined as follows:

\(x_{q}=quantize\left(x, s \right)=roundWithTiesToEven(clip(\frac{x}{s}, -8,7))\)

\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)

Where:

  • \({x}\) is a high-precision floating point value to be quantized.

  • \(x_{q}\) is a quantized INT4 value in the range [-8, 7].

  • \({s}\) is the block’s quantization scale expressed using a 16-bit or 32-bit floating point.

  • \({roundWithTiesToEven}\). Refer to Rounding.

INT4 quantization requires per-block scales. The supported block sizes are {64, 128}. The block dimension should be one of the last two dimensions.

TensorRT only supports INT4 for weight quantization (Q/DQ Layer-Placement Recommendations).

NVFP4 quantization requires per-block scales. The only supported block size is 16. The block dimension should be one of the last two dimensions.

\(x_{q}=quantize\left(x, s \right)=castToFp4(clip(\frac{x}{s}, -6,6))\)

\({x}=dequantize\left(x_{q}, s\right)=x_{q}\ast s\)

Where:

  • \({x}\) is a high-precision floating point value to be quantized.

  • \(x_{q}\) is a quantized FP4 value in the range [-6, 6].

  • \({s}\) is the block’s quantization scale expressed using a 16-bit or 32-bit floating point.

  • \({castToFp4}\) rounds to the nearest value representable in FP4E2M1, ties are rounded to an even number. Refer to Rounding.

When quantizing activations, Dynamic Quantization is required.

Table 11 Quantization Schemes and Precision#

Quantization Schemes

INT8

FP8

MXFP8

INT4

NVFP4

Representation

8-bit signed 2’s complement

S1E4M3 floating point

  • S1E4M3 floating point

4-bit signed 2’s complement

S1E2M1 floating point

Weight quantization

Per-tensor/per-axis

Per-tensor/per-axis

Per-block (block size = 32)

Per-block (block sizes = {64, 128})

Per-block

Activation quantization

Per-tensor

Per-tensor

Dynamic, per-block (block size = 32)

No

Dynamic, per-block

Explicit quantization

Yes

Yes

Yes

Yes

Yes

Scale data type

FP32, FP16, BF16

FP32, FP16, BF16

E8MO

FP32, FP16, BF16

FP32, FP16, BF16