Working with Quantized Types#
TensorRT enables high-performance inference by supporting quantization, a technique that reduces model size and accelerates computation by representing floating-point values with lower-precision data types.
Key Benefits: - Reduces memory footprint - Improves energy efficiency - Enables deployment on resource-constrained edge devices - Achieves greater cost-efficiency in large-scale data center deployments
Quantization Approach: TensorRT uses a symmetric quantization scheme, where both activations and weights are mapped to quantized values centered around zero. This approach simplifies the transformation between quantized and floating-point representations, typically involving only a scaling factor.
Supported Data Types:
INT8 (signed 8-bit integer)
INT4 (signed 4-bit integer, weight-only quantization)
FP8E4M3 (FP8, 8-bit floating point with 4 exponent and 3 mantissa bits)
FP4E2M1 (FP4, 4-bit floating point with 2 exponent and 1 mantissa bit)
These low-precision formats allow TensorRT to deliver efficient inference while maintaining accuracy, making it suitable for deployment in resource-constrained environments and high-throughput applications.
For builder-level precision flags, scaling modes, and strongly typed networks, refer to Precision Control.