Basic tensor quantization functions



alias of pytorch_quantization.tensor_quant.ScaledQuantDescriptor


class pytorch_quantization.tensor_quant.ScaledQuantDescriptor(num_bits=8, name=None, **kwargs)[source]

Supportive descriptor of quantization

Describe how a tensor should be quantized. A QuantDescriptor and a tensor defines a quantized tensor.

  • num_bits – An integer. Number of bits of quantization. It is used to calculate scaling factor. Default 8.

  • name – Seems a nice thing to have

Keyword Arguments
  • fake_quant – A boolean. If True, use fake quantization mode. Default True.

  • axis – None, int or tuple of int. axes which will have its own max for computing scaling factor. If None (the default), use per tensor scale. Must be in the range [-rank(input_tensor), rank(input_tensor)). e.g. For a KCRS weight tensor, quant_axis=(0) will yield per channel scaling. Default None.

  • amax – A float or list/ndarray of floats of user specified absolute max range. If supplied, ignore quant_axis and use this to quantize. If learn_amax is True, will be used to initialize learnable amax. Default None.

  • learn_amax – A boolean. If True, learn amax. Default False.

  • scale_amax – A float. If supplied, multiply amax by scale_amax. Default None. It is useful for some quick experiment.

  • calib_method – A string. One of [“max”, “histogram”] indicates which calibration to use. Except the simple max calibration, other methods are all hisogram based. Default “max”.

  • unsigned – A Boolean. If True, use unsigned. Default False.


TypeError – If unsupported type is passed in.

Read-only properties:
  • fake_quant:

  • name:

  • learn_amax:

  • scale_amax:

  • axis:

  • calib_method:

  • num_bits:

  • amax:

  • unsigned:


Serialize to dict

The build-in __dict__ method returns all the attributes, which includes those have default value and have protected prefix “_”. This method only returns those have values other than the default one and don’t have _ in key. Construct a instance by dict returned by this method should get exactly the same instance.

classmethod from_yaml(yaml_str)[source]

Create descriptor from yaml str


Create yaml serialization Some attributes need special treatment to have human readable form, including amax, axis.


class pytorch_quantization.tensor_quant.TensorQuantFunction(*args, **kwargs)[source]

A universal tensor quantization function

Take an input tensor, output an quantized tensor. The granularity of scale can be interpreted from the shape of amax. output_dtype indicates whether the quantized value will be stored in integer or float. The reason we want to store it in float is the pytorch function takes the quantized value may not accept integer input, e.g. Conv2D.

It uses 2^num_bits -1 values instead of 2^num_bits. e.g., for num_bits=8, it uses [-127, 127] instead of [-128, 127]

static backward(ctx, grad_outputs, grad_scale)[source]

Implements straight through estimation with clipping. For -amax <= input <= amax the gradient passes straight through, otherwise the gradient is zero.

  • ctx – A Context object with saved tensors from forward.

  • grad_outputs – A tensor of gradient of outputs.

  • grad_scale – A tensor of gradient of scale.


grad_inputs – A tensor of gradient.

static forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True)[source]

Follow tensorflow convention, max value is passed in and used to decide scale, instead of inputing scale directly. Though inputing scale directly may be more natural to use.

  • ctx – A Context object to store tensors for backward.

  • inputs – A Tensor of type float32.

  • amax – A Tensor of type float32. Inputs will be quantized within range [-amax, amax] amax will be broadcasted to inputs tensor.

  • num_bits – A integer used to calculate scaling factor, scale = (2^(num_bits-1) - 1) / max Effectively, it indicates how many integer bits is used to represent the value. Default 8.

  • output_dtype – A type of Tensor. torch.int32 or torch.float32.

  • unsigned – A boolean. Use unsigned integer range. E.g. [0, 255] for num_bits=8. Default False.

  • narrow_range – A boolean. Use symmetric integer range for signed quantization E.g. [-127,127] instead of [-128,127] for num_bits=8. Default True.


outputs – A Tensor of type output_dtype. scale: A Tensor of type float32. outputs / scale will dequantize outputs tensor.



tensor_quant is alias of TensorQuantFunction.apply

fake_tensor_quant is alias of FakeTensorQuantFunction.apply