Working with Quantized Types#

TensorRT-RTX supports the reduced-precision data types INT4, INT8, FP4, and FP8 for improved performance at the cost of accuracy. FP8 is only supported for matrix multiplications (GEMM) on Ada and later architectures (compute capability 8.9 or later), while FP4 is only supported for matrix multiplications on Blackwell or later architectures (compute capability 10.0 or later).

You must explicitly select reduced precision (or quantization) for each layer that should use it. To do this, insert IQuantizeLayer and IDequantizerLayer (Q/DQ) nodes in the graph. You can perform quantization during the training process (Quantization Aware Training = QAT) or in a separate postprocessing step (Post-Training Quantization = PTQ).

Several popular deep learning frameworks allow model quantization using either QAT or PTQ, such as:

You can encode information about which layers to quantize in an ONNX model file using the QuantizeLinear - ONNX 1.19.0 documentation and DequantizeLinear - ONNX 1.19.0 documentation operators, and import it using the TensorRT-RTX ONNX parser.

Finally, the NVIDIA TensorRT Model Optimizer (TensorRT-Model-Optimizer/examples/windows at main · NVIDIA/TensorRT-Model-Optimizer · GitHub) is an open source tool specifically developed to help TensorRT and TensorRT-RTX users add quantization to a pretrained model to achieve higher performance. Refer to the Strongly Typed Networks and Explicit Quantization section if you are porting an existing FP32 model.