DynamicQuantize¶
This layer performs dynamic block quantization of its input tensor and outputs the quantized data and the computed block scale-factors. The block size is currently limited to 16 and the size of the blocked axis must be divisible by 16 and a compile-time constant.
The first input (index 0) is the tensor to be quantized. Its data type must be one of DataType::kFLOAT, DataType::kHALF, or DataType::kBF16.
The second input (index 1) is the double quantization scale factor. It is a scalar scale factor used to quantize the computed block scales-factors.
Attributes¶
axis
The axis that is sliced into blocks. The axis must be the last dimension or the second to last dimension.
block_size
The number of elements that are quantized using a shared scale factor. Currently only blocks of 16 elements are supported.
output_type
The data type of the quantized output tensor, must be DataType::kFP4.
scale_type
The data type of the scale factor used for quantizing the input data, must be DataType::kFP8.
Inputs¶
input: tensor of type T1
.
double_quant_scale: tensor of type T1
or float32. Provides a scalar double-quantization scale. The double_quant_scale tensor must be a build-time constant.
Outputs¶
output_data: tensor of type T2
.
output_scale: tensor of type T3
.
Data Types¶
T1: float16
, bfloat16
, float32
T2: float4
T3: float8
Shape Information¶
input and output_data are 2D or 3D tensors of the same shape.
output_scale has the same shape as input, except for the dimension specified by axis
in which blocking occurs.
E.g. for input shape (D0, D1, D2), axis=2
, output_scale shape is (D0, D1, D2/block_size)
.
double_quant_scale is a tensor containing a single scalar.
DLA Support¶
Not supported.
Examples¶
DynamicQuantize
in1 = network.add_input("input", trt.float32, (2, 32))
# Define double quant scale factor
dbl_quant_scale_data = np.asarray([1], dtype=np.float32)
dbl_quant_scale = network.add_constant(shape=(1,), weights=dbl_quant_scale_data).get_output(0)
# Add the DynamicQuantizeLayer
block_size = 16
axis = 1
dynq = network.add_dynamic_quantize(in1, axis, block_size, trt.fp4, trt.fp8)
dynq.set_input(1, dbl_quant_scale)
data_f4 = dynq.get_output(0)
scale_f8 = dynq.get_output(1)
# Dequantize the scale (per-tensor)
dequantize_scale = network.add_dequantize(scale_f8, dbl_quant_scale, trt.float32)
scale_f32 = dequantize_scale.get_output(0)
# Dequantize the data (per-block)
dequantize_data = network.add_dequantize(data_f4, scale_f32, trt.float32)
data_dq = dequantize_data.get_output(0)
# Mark dequantized data as netowrk output
network.mark_output(data_dq)
inputs[in1.name] = np.array(
[
[0.0, 0.3, 0.6, 1.0, 1.3, 1.6, 1.9, 2.3, 2.6, 2.9, 3.2, 3.5, 3.9, 4.2, 4.5, 4.8,
5.2, 5.5, 5.8, 6.1, 6.5, 6.8, 7.1, 7.4, 7.7, 8.1, 8.4, 8.7, 9., 9.4, 9.7, 10.0],
[3.0, 3.3, 3.6, 4.0, 3.3, 3.6, 3.9, 3.3, 2.6, 2.9, 3.2, 3.5, 3.9, 4.2, 4.5, 4.8,
-5.2, -5.5, -5.8, -6.1, -5.5, -5.8, 5.1, 5.4, 5.7, 5.1, 5.4, 5.7, 6., 6.4, 4.7, 6.0],
]
)
outputs[dequantize_data.get_output(0).name] = dequantize_data.get_output(0).shape
expected[dequantize_data.get_output(0).name] = np.array(
[
[0.00, 0.41, 0.41, 0.81, 1.22, 1.62, 1.62, 2.44, 2.44, 3.25, 3.25, 3.25, 3.25, 4.87, 4.87, 4.87,
4.87, 4.87, 6.50, 6.50, 6.50, 6.50, 6.50, 6.50, 6.50, 6.50, 9.75, 9.75, 9.75, 9.75, 9.75, 9.75],
[3.25, 3.25, 3.25, 3.25, 3.25, 3.25, 3.25, 3.25, 2.44, 3.25, 3.25, 3.25, 3.25, 4.87, 4.87, 4.87,
-4.50, -4.5, -6.75, -6.75, -4.5, -6.75, 4.5, 4.5, 6.75, 4.5, 4.5, 6.75, 6.75, 6.75, 4.5, 6.75]
]
)
C++ API¶
For more information about the C++ IDynamicQuantizeLayer operator, refer to the C++ IDynamicQuantizeLayer documentation.
Python API¶
For more information about the Python IDynamicQuantizeLayer operator, refer to the Python IDynamicQuantizeLayer documentation.