DynamicQuantize¶

This layer performs dynamic block quantization of its input tensor and outputs the quantized data and the computed block scale-factors. The layer can be used in two modes: - 1D block quantization: The block size is specified by the block_size attribute. The axis that is sliced into blocks is specified by the axis attribute. - 2D block quantization: The block shape is specified by the block_shape attribute.

For 1D The block size is currently limited to 16 for FP4 (NVFP4), or 32 for FP8 (MXFP8), and the extent of the blocked axis must be a compile-time constant that is divisible by block_size and.

The first input (index 0) is the tensor to be quantized.

The second input (index 1) is the double quantization scale factor. It is a scalar scale factor used to quantize the computed block scales-factors.

Attributes¶

axis The axis that is sliced into blocks. The axis must be the last dimension or the second to last dimension.

block_size The number of elements that are quantized using a shared scale factor. Currently only blocks of 16 elements are supported.

block_shape The shape of the block to be quantized. The shape must be a compile-time constant, and its rank must be the same as the rank of the input tensor.: “-1” means that the block shape at that dimension is set to the same extent as the input tensor.

output_type The data type of the quantized output tensor, must be DataType::kFP4.

scale_type The data type of the scale factor used for quantizing the input data, must be DataType::kFP8.

Inputs¶

input: tensor of type T1.

double_quant_scale: tensor of type T1 or float32. Provides a scalar double-quantization scale. The double_quant_scale tensor must be a build-time constant.

Outputs¶

output_data: tensor of type T2.

output_scale: tensor of type T3.

Data Types¶

T1: float16, bfloat16, float32

T2: float4

T3: float8, float32

Shape Information¶

input and output_data are 2D or 3D tensors of the same shape. For 2D block quantization, 4D I/O tensors are also supported.

output_scale has the same rank as input, and the shape is the same as input, where each dimension is divided by the block size defined by block_size combined with axis or by the block shape defined by block_shape. For example, for input shape (D0, D1, D2), axis=2, output_scale shape is (D0, D1, D2/block_size). or , for 2D block quantization, for input shape (D0, ... , Dn), block_shape=(B0, B1, ... , Bn), output_scale shape is (D0/B0, D1/B1, ... , Dn/Bn).

double_quant_scale is a tensor containing a single scalar.

DLA Support¶

Not supported.

Examples¶

DynamicQuantize

in1 = network.add_input("input", trt.float32, (2, 32))

# Define double quant scale factor
dbl_quant_scale_data = np.asarray([1], dtype=np.float32)
dbl_quant_scale = network.add_constant(shape=(1,), weights=dbl_quant_scale_data).get_output(0)

# Add the DynamicQuantizeLayer
block_size = 16
axis = 1
dynq = network.add_dynamic_quantize(in1, axis, block_size, trt.fp4, trt.fp8)
dynq.set_input(1, dbl_quant_scale)
data_f4 = dynq.get_output(0)
scale_f8 = dynq.get_output(1)

# Dequantize the scale (per-tensor)
dequantize_scale = network.add_dequantize(scale_f8, dbl_quant_scale, trt.float32)
scale_f32 = dequantize_scale.get_output(0)

# Dequantize the data (per-block)
dequantize_data = network.add_dequantize(data_f4, scale_f32, trt.float32)
data_dq = dequantize_data.get_output(0)

# Mark dequantized data as netowrk output
network.mark_output(data_dq)

inputs[in1.name] = np.array(
    [
        [0.0, 0.3, 0.6, 1.0, 1.3, 1.6, 1.9, 2.3, 2.6, 2.9, 3.2, 3.5, 3.9, 4.2, 4.5, 4.8,
         5.2, 5.5, 5.8, 6.1, 6.5, 6.8, 7.1, 7.4, 7.7, 8.1, 8.4, 8.7, 9., 9.4, 9.7, 10.0],
        [3.0, 3.3, 3.6, 4.0, 3.3, 3.6, 3.9, 3.3, 2.6, 2.9, 3.2, 3.5, 3.9, 4.2, 4.5, 4.8,
         -5.2, -5.5, -5.8, -6.1, -5.5, -5.8, 5.1, 5.4, 5.7, 5.1, 5.4, 5.7, 6., 6.4, 4.7, 6.0],
    ]
)

outputs[dequantize_data.get_output(0).name] = dequantize_data.get_output(0).shape

expected[dequantize_data.get_output(0).name] = np.array(
    [
        [0.00, 0.41, 0.41, 0.81, 1.22, 1.62, 1.62, 2.44, 2.44, 3.25, 3.25, 3.25, 3.25, 4.87, 4.87, 4.87,
         4.87, 4.87, 6.50, 6.50, 6.50, 6.50, 6.50, 6.50, 6.50, 6.50, 9.75, 9.75, 9.75, 9.75, 9.75, 9.75],
        [3.25, 3.25, 3.25, 3.25, 3.25, 3.25, 3.25, 3.25, 2.44, 3.25, 3.25, 3.25, 3.25, 4.87, 4.87, 4.87,
         -4.50, -4.5, -6.75, -6.75, -4.5, -6.75, 4.5, 4.5, 6.75, 4.5, 4.5, 6.75, 6.75, 6.75, 4.5, 6.75]
    ]
)

2D Block DynamicQuantize

in1 = network.add_input("input", trt.float32, (8, 8))

# Add the DynamicQuantizeLayer
block_shape = trt.Dims([4, 3])
dynq = network.add_dynamic_quantize_v2(in1, block_shape, trt.fp8, trt.float32)
data_f8 = dynq.get_output(0)
scale_f32 = dynq.get_output(1)

# Dequantize the data (per-block)
dequantize_data = network.add_dequantize(data_f8, scale_f32, trt.float32)
dequantize_data.block_shape = block_shape
data_dq = dequantize_data.get_output(0)

# Mark dequantized data as netowrk output
network.mark_output(data_dq)

inputs[in1.name] = np.array(
    [
        [0.0, 0.3, 0.6, 1.0, 1.3, 1.6, 1.9, 2.3, 2.6, 2.9, 3.2, 3.5, 3.9, 4.2, 4.5, 4.8,
         5.2, 5.5, 5.8, 6.1, 6.5, 6.8, 7.1, 7.4, 7.7, 8.1, 8.4, 8.7, 9., 9.4, 9.7, 10.0,
          3.0, 3.3, 3.6, 4.0, 3.3, 3.6, 3.9, 3.3, 2.6, 2.9, 3.2, 3.5, 3.9, 4.2, 4.5, 4.8,
         -5.2, -5.5, -5.8, -6.1, -5.5, -5.8, 5.1, 5.4, 5.7, 5.1, 5.4, 5.7, 6., 6.4, 4.7, 6.0],
    ]
).reshape(8, 8)

outputs[dequantize_data.get_output(0).name] = dequantize_data.get_output(0).shape

expected[dequantize_data.get_output(0).name] = np.array(
    [
        [[ 0.  ,  0.3 ,  0.6 ,  1.01,  1.26,  1.68,  1.96,  2.32],
        [ 2.7 ,  3.  ,  3.3 ,  3.36,  4.03,  4.36,  4.64,  4.64],
        [ 5.4 ,  5.4 ,  6.  ,  6.04,  6.71,  6.71,  7.14,  7.14],
        [ 7.8 ,  8.4 ,  8.4 ,  8.73,  8.73,  9.4 , 10.  , 10.  ],
        [ 2.9 ,  3.31,  3.73,  4.11,  3.2 ,  3.66,  3.86,  3.21],
        [ 2.69,  2.9 ,  3.11,  3.43,  4.11,  4.11,  4.29,  4.71],
        [-5.39, -5.39, -5.8 , -5.94, -5.49, -5.94,  5.14,  5.57],
        [ 5.8 ,  4.97,  5.39,  5.49,  5.94,  6.4 ,  4.71,  6.  ]]
    ])

C++ API¶

For more information about the C++ IDynamicQuantizeLayer operator, refer to the C++ IDynamicQuantizeLayer documentation.

Python API¶

For more information about the Python IDynamicQuantizeLayer operator, refer to the Python IDynamicQuantizeLayer documentation.