Quantization Workflows#
TensorRT supports both post-training quantization (PTQ) and quantization-aware training (QAT) workflows. These workflows enable you to optimize models for low-precision data types.
The quantization process uses per-tensor, per-channel, or block-wise scaling. The scaling method depends on the layer and data type, which helps preserve model accuracy during conversion.
Post-Training Quantization (PTQ): - Quantizes a pre-trained model without retraining - Requires representative “calibration data” set to compute quantization parameters - Practical when retraining is infeasible due to resource limitations or data privacy concerns - Can lead to accuracy degradation for complex models or sensitive layers
Quantization-Aware Training (QAT): - Simulates quantization during training by quantizing weights and activation layers - Training actively compensates for quantization and dequantization effects - Generally achieves superior accuracy recovery compared to PTQ - More time-consuming and requires access to the entire labeled training dataset
The TensorRT Model Optimizer is a Python toolkit designed to help you create QAT models. These models are fully compatible with TensorRT’s optimization and deployment workflows.
The toolkit also provides a PTQ recipe that lets you perform PTQ on models developed in both PyTorch and ONNX formats, streamlining the quantization process across different frameworks.
Explicit Quantization Overview#
A TensorRT network is explicitly quantized when it contains Quantize and Dequantize layers (Q/DQ for short), which mark exactly where conversions to and from a quantized type occur. The optimizer performs only the conversions dictated by the semantics of the model.
In explicitly quantized networks, the quantization and dequantization operations are represented by IQuantizeLayer (C++, Python) and IDequantizeLayer (C++, Python) nodes in the graph. These will henceforth be referred to as Q/DQ nodes. IDynamicQuantizeLayer extends this API for Dynamic Quantization workflows where scales are computed at inference time.
ONNX uses an explicitly quantized representation: when a model in PyTorch or TensorFlow is exported to ONNX, each fake-quantization operation in the framework’s graph is exported as a Quantize node followed by a Dequantize node. Since TensorRT preserves the semantics of these layers, users can expect accuracy that is very close to that seen in the deep learning framework. While optimizations preserve the arithmetic semantics of quantization and dequantization operators, they can change the order of floating-point operations in the model, so results will not be bitwise identical.
Key Capabilities:
Broad Data Type Support: Supports INT8, FP8, INT4, and FP4 quantized data types.
Precise Placement Control: Q/DQ nodes specify exactly where conversions to and from quantized types occur, so the optimizer applies only the conversions dictated by the model’s semantics.
Compatibility with ONNX Export: Performing QAT or PTQ in a deep learning framework and exporting to ONNX naturally produces an explicitly quantized model, because ONNX represents fake-quantization operations as Q/DQ pairs.
Calibration is performed before ONNX export (for example, during PTQ or QAT with Model Optimizer), and the resulting scales are embedded directly into the Q/DQ nodes.
Quantization Granularities#
Quantization granularity refers to how quantization scale factors are applied across a model’s tensors. Selecting the appropriate granularity is a direct lever for balancing the benefits of quantization (such as memory reduction) against its potential drawbacks (such as accuracy loss). A more granular approach increases potential accuracy but also increases the computational and memory overhead of managing multiple scaling factors. TensorRT supports three quantization scale granularities:
Per-tensor quantization: a single scale value (scalar) is used to scale the entire tensor.
Per-channel quantization: a scale tensor is broadcast along the given axis - for convolutional neural networks, this is typically the channel axis.
Block quantization: the tensor is divided into fixed-size blocks, and a scale factor is defined for each block.
IQuantizeLayerandIDequantizeLayersupport 1D block shapes (along a single dimension). For 2D block shapes (along the last two dimensions), useIDynamicQuantizeLayer.
When using per-channel quantization with Convolutions, the quantization axis must be the output-channel axis. For example, when the weights of 2D convolution are described using KCRS notation, K is the output-channel axis, and the weights quantization can be described as:
For each k in K:
For each c in C:
For each r in R:
For each s in S:
output[k,c,r,s] := clamp(round(input[k,c,r,s] / scale[k]))
The scale is a vector of coefficients and must have the same size as the quantization axis.
Dequantization is performed similarly except for the pointwise operation that is defined as:
output[k,c,r,s] := input[k,c,r,s] * scale[k]
Block Quantization
In block quantization, elements are grouped into blocks, with all elements in a block sharing a common scale factor. Block quantization is supported for inputs of up to 3 dimensions.
INT4 block quantization supports weight-only quantization (WoQ).
FP4 block quantization supports both weights and activations. To minimize quantization error, use Dynamic Quantization for activations.
When using block quantization, the scale tensor dimensions equal the data tensor dimensions except for one dimension over which blocking is performed (the blocking axis). For example, given a 2-D RS weights input, R (dimension 0) as the blocking axis and B as the block size, the scale in the blocking axis is repeated according to the block size and can be described like this:
For each r in R:
For each s in S:
output[r,s] = clamp(round(input[r,s] / scale[ceil(r//B), s]))
The scale is a 2D array of coefficients with dimensions (ceil(R//B), S).
Dequantization is performed similarly, except for the pointwise operation that is defined as:
output[r,s] = input[r,s] * scale[ceil(r//B), s]
Quantized Types Rounding Modes#
TensorRT primarily uses the round-to-nearest-even method (also known as “banker’s rounding”), which rounds ties to the nearest even value. For example, 2.5 rounds to 2 and 3.5 rounds to 4. This method helps reduce systematic bias in the quantization process, preventing the consistent upward or downward drift that can occur with other rounding strategies.
For more information about rounding modes, refer to Rounding.
Dynamic Quantization#
Dynamic Quantization is a form of quantization in which the scales are computed during inference according to the input data. It produces two outputs: quantized data and scales. TensorRT supports Dynamic Quantization only with block quantization granularity.
Dynamic Quantization has two main benefits:
Accuracy: With Dynamic Quantization, TensorRT selects a scale that maps only the dynamic range of a single block to the quantized type. Because the dynamic range of a single block is often much smaller than the dynamic range of the entire tensor, the quantization error is reduced. This is most significant for sub-byte quantized types because of the small range of values representable in these data types.
Reduced PTQ overhead: Because TensorRT computes the scales automatically during inference, you don’t need to calibrate them using sample data.
For each block, the scale is computed by:
\({scale}=max_{i\in \left\{ 0...blockSize-1 \right\}}\left( \frac{abs\left( x_{i} \right)}{qTypeMax} \right)\)
Where:
\({qTypeMax}\) is the maximum value in the quantized type (such as 6 for FP4E2M1).
MX-Compliant Dynamic Quantization#
Dynamic Quantization according to the OCP Microscaling Formats (MX) Specification v1.0. The MX-Compliant recipe performs block quantization, quantizing across 32 high-precision elements to produce 32 quantized output values and one E8M0 scaling factor.
TensorRT currently supports MX-Compliant Dynamic Quantization only for the FP8E4M3 vector format, referred to as MXFP8.
The scale computation for a single block is defined as:
\(scale_{E8M0}=round\_up\_to\_e8m0\left( max_{i\in \left\{ 0...blockSize-1 \right\}}\left( \frac{abs\left( x_{i} \right)}{qTypeMax} \right) \right)\)
Where:
\(E8M0\) is an 8-bit exponent-only floating point type, as described in Supported Types.
\(round\_up\_to\_e8m0\) is the computed scale rounded up to the smallest power of two that is greater than or equal to it.
\(qTypeMax\) is the maximum value representable in the quantized type used for data.
The scale computation is repeated for each block, computing a total of \(\frac{inputVolume}{blockSize}\) block scales.
Dynamic Double Quantization#
A variant of Dynamic Quantization, in which the computed scales are also quantized. Putting together the scale computation and scale quantization for a single block:
\(scale_{quantized}=quantize\left( max_{i\in \left\{ 0...blockSize-1 \right\}}\left( \frac{abs\left( x_{i} \right)}{qTypeMax} \right), scale=globalSf \right)\)
Where:
\(globalSf\) is an offline-calibrated per-tensor quantization scale (scalar).
TensorRT currently supports Dynamic Double Quantization only for the NVFP4 vector format (FP4E2M1 data, FP8E4M3 scales, block size of 16).
Using \(qTypeMax=6\) and the FP8 range of [-448,448], the quantized scale can be written as:
\(scale_{fp8}=castToFp8\left( \frac{max_{i\in \left\{ 0...blockSize-1 \right\}}\left( ^{abs\left( x_{i} \right)} \right)}{6^{\ast }globalSf} \right)\)
The quantized data is computed using block quantization and the computed scales.
To dequantize data that was quantized using Dynamic Double Quantization, two consecutive Dequantize operations are required (which is why it’s called double quantization): the first dequantizes the scales using per-tensor quantization, and the second dequantizes the data.
\(data_{DQ}=dequantize\left( data_{Q},dequantize\left( scale_{Q},scale=globalSf \right) \right)\)