Quantize#

Quantize a float input tensor into an integer output tensor. The quantization computation is as follows: \(output_{i_0,..,i_n} = \text{clamp}(\text{round}(\frac{input_{i_0,..,i_n}}{scale} + \text{zero_point}))\).

Attributes#

axis The axis to perform the quantization on.

toType The DataType of the output tensor. Defaults to int8.

Inputs#

input: tensor of type T1.

scale: tensor of type T1 that provides the quantization scale. The scale tensor must be a build-time constant. Its dimensions must be a scalar for per-tensor quantization, a 1-D tensor for per-channel quantization, or the same rank as the input tensor for block quantization (supported for DataType::kINT4 only).

zero_point: tensor of type T2 that provides the quantization zero-point. The zero_point tensor is optional and will be assumed to be zero if not set. The zero_point must only contain zero-valued coefficients if set, and must have the same shape as scale.

Outputs#

output: tensor of type T3.

Data Types#

T1: float16, bfloat16, float32

T2: float32

T3: int4, int8, float8

Shape Information#

input and output are tensors with a shape of \([a_0,...,a_n]\).

scale and zero_point must have the same shape, if zero_point is defined.

Volume Limits#

input, scale, and zero_point can have up to \(2^{31}-1\) elements.

Examples#

Quantize
in1 = network.add_input("input1", dtype=trt.float32, shape=(1, 1, 3, 3))
scale = network.add_constant(shape=(1,), weights=np.array([1 / 127], dtype=np.float32))
quantize = network.add_quantize(in1, scale.get_output(0))
quantize.axis = 3
dequantize = network.add_dequantize(quantize.get_output(0), scale.get_output(0))
dequantize.axis = 3
network.mark_output(dequantize.get_output(0))

inputs[in1.name] = np.array(
    [
        [
            [0.56, 0.89, 1.4],
            [-0.56, 0.39, 6.0],
            [0.67, 0.11, -3.6],
        ]
    ]
)

outputs[dequantize.get_output(0).name] = dequantize.get_output(0).shape
expected[dequantize.get_output(0).name] = np.array(
    [
        [
            [0.56, 0.89, 1],
            [-0.56, 0.39, 1.0],
            [0.67, 0.11, -1.0],
        ]
    ]
)

Block Quantization
in1 = network.add_input("input1", dtype=trt.float32, shape=(1, 8))
weights = network.add_constant(shape=(4, 8), weights=np.array([
                                                               [1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0],
                                                               [1.1, 1.2, 2.1, 2.2, 3.1, 3.2, 4.1, 4.2],
                                                               [4.0, 4.0, 5.0, 5.0, 6.0, 6.0, 7.0, 7.0],
                                                               [4.1, 4.2, 5.1, 5.2, 6.1, 6.2, 7.1, 7.2],
                                                               ], dtype=np.float32))
scale = network.add_constant(shape=(2, 8), weights=np.array([
                                                            [1, 1, 2, 2, 3, 3, 4, 4],
                                                            [4, 4, 5, 5, 6, 6, 7, 7]
                                                          ], dtype=np.float32))
quantize = network.add_quantize(weights.get_output(0), scale.get_output(0), trt.int4)
dequantize = network.add_dequantize(quantize.get_output(0), scale.get_output(0), trt.float32)
network.mark_output(dequantize.get_output(0))

inputs[in1.name] = np.array(
    [
        [2, 2, 2, 2, 2, 2, 2, 2],
    ]
)

outputs[dequantize.get_output(0).name] = dequantize.get_output(0).shape
expected[dequantize.get_output(0).name] = np.array(
    [
        [
            [1, 1, 2, 2, 3, 3, 4, 4],
            [1, 1, 2, 2, 3, 3, 4, 4],
            [4, 4, 5, 5, 6, 6, 7, 7],
            [4, 4, 5, 5, 6, 6, 7, 7],
        ]
    ]
)

C++ API#

For more information about the C++ IQuantizeLayer operator, refer to the C++ IQuantizeLayer documentation.

Python API#

For more information about the Python IQuantizeLayer operator, refer to the Python IQuantizeLayer documentation.