Introduction to Quantization¶
What is Quantization?¶
Quantization is the process of converting continuous values to discrete set of values using linear/non-linear scaling techniques.
High precision is necessary during training for fine-grained weight updates.
High precision is not usually necessary during inference and may hinder the deployment of AI models in real-time and/or in resource-limited devices.
INT8 is computationally less expensive and has lower memory footprint.
INT8 precision results in faster inference with similar performance.
See whitepaper for more detailed explanations.
Let [β, α] be the range of representable real values chosen for quantization and
b be the bit-width of the signed integer representation.
The goal of uniform quantization is to map real values in the range [β , α] to lie within [-2b-1, 2b-1 - 1]. The real values that lie outside this range are clipped to the nearest bound.
Considering 8 bit quantization (b=8), a real value within range [β, α] is quantized to lie within the quantized range
[-128, 127] (see source):
where, scale = (α - β) / (2b-1)
zeroPt = -round(β * scale) - 2b-1
round is a function that rounds a value to the nearest integer. The quantized value is then clamped between -128 to 127.
DeQuantization is the reverse process of quantization (see source):
Quantization in TensorRT¶
TensorRT™ only supports symmetric uniform quantization, meaning that
zeroPt=0 (i.e. the quantized value of 0.0 is always 0).
Considering 8 bit quantization (
b=8), a real value within range [
max_float] is quantized to lie within the quantized range
[-127, 127], opting not to use
-128 in favor of symmetry. It is important to note that we loose 1 value in symmetric quantization representation, however, loosing 1 out of 256 representable value for 8 bit quantization is insignificant.
The mathematical representation for symmetric quantization (
Since TensorRT supports only symmetric range, the scale is calculated using the max absolute value:
Let α =
scale = α/(2b-1-1)
Rounding type is rounding-to-nearest ties-to-even.
The quantized value is then clamped between
DeQuantization Symmetric dequantization is the reverse process of symmetric quantization:
Scaling factor divides a given range of real values into a number of partitions.
Lets understand intution behind scaling factor formula by taking 3 bit quantization as an example.
Real values range: [β, α]
Quantized values range: [-23-1, 23-1-1]
i.e. [-4, -3, -2, -1, 0, 1, 2, 3]
As expected there are 8 quantized (23) values for 3 bit quantization.
Scale divides range into partitions. There are 7 (23-1) partitions for 3 bit quantization.
scale = (α - β) / (23-1)
Symmetric quantization brings in two changes
Real values are not free now but are restricted. i.e [-α, α]
where α =
One value from quantization range is dropped in favor of symmetry leading to a new range [-3, -2, -1, 0, 1, 2, 3].
There are now 6 (23-2) partitions (unlike 7 for asymmetric quantization).
Scale divides range into partitions.
scale = 2*α /(23 - 2) = α/(23-1-1)
Similar intution holds true for
b bit quantization.
Quantization Zero Point¶
zeroPt is of the same type as quantized values xq, and is in fact the quantized value xq corresponding to the real value 0. This allows us to auto-matically meet the requirement that the real value r = 0 be exactly representable by a quantized value. The motivation for this requirement is that efficient implementation of neural network operators often requires zero-padding of arrays around boundaries.
If we have values with negative data, then the zero point can offset the range. So if our zero point was 128, then unscaled negative values -127 to -1 would be represented by 1 to 127, and positive values 0 to 127 would be represented by 128 to 255.