Step #2: Optimize Model With TAO – Prune, Quantize, TensorRT Conversion
This lab will use the link from the left-hand navigation pane throughout the course of the lab.
The key objective of this lab is to familiarize with various optimizations in TAO like model pruning and quantization. You will compare the inference performance of the optimized model with the unoptimized model.
More information on model pruning can be found at the following link: Technical Blog: Pruning Models
More information on model quantization can be found at following link: Technical Blog: Improving INT8 Accuracy Using Quantization
Open and run through the YOLO Optimization notebook, by clicking the Jupyter Notebook link in the left hand navigation pane and then selecting and running lab2-yolo_optimization.ipynb, in the tutorial folder.
Import saved data, including trained model, from Step 1 & run inference on model.
Export the trained model in FP32 format. We can export in FP32, FP16 or INT8. The result will be a model in etlt format which then needs to be built into a TRT engine file to run inference.
Run inference to get baseline numbers. Compare the inference time on the unoptimized model with the FP32 quantized model.
Prune the model to reduce the model size and accelerate inference time. Pruning removes parameters from the model to reduce the model size without compromising the integrity of the model.
Retrain the pruned model to recover lost accuracy. Once the model has been pruned, there might be a slight decrease in accuracy because some previously useful weights may have been removed. To regain accuracy, we can retrain the model on the same dataset.
Run inference on pruned model and compare numbers with baseline. You should observe the same accuracy, but with a significantly smaller model.