Improving Model Accuracy#
Depending on the builder configuration, TensorRT can execute a layer in FP32, FP16, BF16, FP8, or INT8 precision. By default, TensorRT chooses to run a layer in a precision that results in optimal performance. Sometimes, this can result in poor accuracy. Generally, running a higher-precision layer helps improve accuracy with some performance hits.
There are several steps that we can take to improve model accuracy:
Validate layer outputs:
Use Polygraphy to dump layer outputs and verify no NaNs or Infs. The
--validateoption can check for NaNs and Infs. Also, we can compare layer outputs with golden values from, such as ONNX runtime.For FP16 and BF16, a model might require retraining to ensure that intermediate layer output can be represented in FP16/BF16 precision without overflow or underflow.
For INT8, consider recalibrating with a more representative calibration data set. If your model comes from PyTorch, we also provide the TensorRT Model Optimizer for QAT in the framework besides PTQ in TensorRT. You can try both approaches and choose the one with more accuracy.
Manipulate layer precision:
Sometimes, running a layer with a certain precision results in incorrect output. This can be due to inherent layer constraints (such as
LayerNormoutput should not be INT8) or model constraints (output gets diverged, resulting in poor accuracy).You can control layer execution precision and output precision.
An experimental debug precision tool can help automatically find layers to run with high precision.
Use the Editable Timing Cache to select a proper tactic.
When accuracy changes between two built engines for the same model, it might be due to a bad tactic being selected for a layer.
Use Editable Timing Cache to dump available tactics. Update the cache with a proper one.
Accuracy from run-to-run variation should not change; after the engine is built for a specific GPU, it should result in bit-accurate outputs in multiple runs. If not, file a TensorRT bug.