Quantization Aware Training#

TAO supports Quantization-Aware-Training (QAT) for its object detection networks, namely EfficientDet-Tf2 and Classification networks in TensorFlow2.

Note

The TensorFlow2 finetuning pipelines (EfficientDet-Tf2 and Classification) referenced here are deprecated as of TAO 6.25.09. See the Deprecations section for details and fallback guidance. The QAT workflow described below is retained for reference for these pipelines.

Quantization Aware Training emulates the inference time quantization when training a model that may then be used by downstream inference platforms to generate actual quantized models. The error from quantizing weights and tensors to INT8 is modeled during training, allowing the model to adapt and mitigate the error. During QAT, the model constructed in the training graph is modified to:

  1. Replace existing nodes with nodes that support fake quantization of its weights.

  2. Convert existing activations to ReLU-6 (except the output nodes).

  3. Add Quantize and De-Quantize(QDQ) nodes to compute the dynamic ranges of the intermediate tensors.

The dynamic ranges computed during training are serialized to a cache file at export, which may then be parsed by NVIDIA® TensorRT to create an optimized inference engine. To enable QAT during training, simply set the enable_qat parameter to be true in the training_config field of the corresponding specification file of each of the supported networks. The benefit of QAT training is usually a better accuracy when doing INT8 inference with TensorRT compared with traditional calibration based INT8 TensorRT inference.

Note

The number of scales present in the cache file is less than that generated by the Post Training Quantization technique using TensorRT. This is because the QDQ nodes are added only after operations that are fused by TensorRT (in GPU) eg: operation sequences such as Conv2d -> Bias -> Relu or Conv2d -> Bias -> BatchNormalization -> Activation, whereas during PTQ, the scales are applied to all the intermediate tensors in the model. Also, the final output regression nodes are not quantized in the current training graphs. So these layers currently run in fp32.

Note

When deploying a model with platforms that have DLA, please note that currently using Quantization cache files generated by peeling the scales from the model is not supported, since DLA requires a scale factor for all layers. In order to use a QAT trained model with DLA, we recommend using the post training quantization at export. The Post Training Quantization method takes the current QAT trained model and generates scale factors for all intermediate tensors in the model since the DLA doesn’t fuse operations as done by the GPU. More information about this can be found in the Exporting the Model sections of each app.

The recommended workflow for training a Quantization Aware model is depicted in the diagram below.

../_images/tao_cv_qat_workflow_all_networks.png