Is this page helpful?

Working with Quantized Types#

TensorRT enables high-performance inference by supporting quantization, a technique that reduces model size and accelerates computation by representing floating-point values with lower-precision data types.

Key Benefits:

Reduces memory footprint
Improves energy efficiency
Enables deployment on resource-constrained edge devices
Achieves greater cost-efficiency in large-scale data center deployments

Quantization Approach: TensorRT uses a symmetric quantization scheme, where both activations and weights are mapped to quantized values centered around zero. This approach simplifies the transformation between quantized and floating-point representations, typically involving only a scaling factor.

Supported Data Types:

INT8 (signed 8-bit integer)
INT4 (signed 4-bit integer, weight-only quantization)
FP8E4M3 (FP8, 8-bit floating point with 4 exponent and 3 mantissa bits)
FP4E2M1 (FP4, 4-bit floating point with 2 exponent and 1 mantissa bit)

These low-precision formats allow TensorRT to deliver efficient inference while maintaining accuracy, making it suitable for deployment in resource-constrained environments and high-throughput applications.

For builder-level precision flags, scaling modes, and strongly typed networks, refer to Precision Control.