Release Notes :: NVIDIA Deep Learning TensorRT Documentation

TensorRT Release 8.x.x

TensorRT Release 8.4.0 Early Access (EA)

These are the TensorRT 8.4.0 Early Access (EA) Release Notes and are applicable to x86 Linux and Windows users. This release incorporates ARM^® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes.

These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

This EA release is for early testing and feedback. For production use of TensorRT, continue to use TensorRT 8.2.3 or later TensorRT 8.2.x patch.

Note: TensorRT 8.4 EA does not include updates to the CUDA network repository. You should use the local repo installer package instead.

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.

Reduce engine file size and runtime memory usage on some networks with large spatial dimension convolution or deconvolution layers.
The following C++ API functions and enums were added:
The following Python API functions and enums were added:
- set_memory_pool_limit
- get_memory_pool_limit
- MemoryPoolType
- max_threads property (Builder.max_threads, Refitter.max_threads, Runtime.max_threads)
- get_builder_plugin_registry
Improved the performance of some convolutional neural networks trained in TensorFlow and exported using the tf2onnx tool or running in TF-TRT.
Added the --layerPrecisions and --layerOutputTypes flags to the trtexec tool to allow you to specify layer-wise precision constraints and layer-wise output type constraints.
Added the --memPoolSize flag to the trtexec tool to allow you to specify the size of the workspace as well as the DLA memory pools via a unified interface.
Added a new interface to customize and query the sizes of the three DLA memory pools: managed SRAM, local DRAM, and global DRAM. For consistency with past behavior, the pool sizes apply per-subgraph (i.e. per-loadable). Upon loadable compilation success, the builder reports the actual amount of memory used per pool by each loadable, thus allowing for fine-tuning; upon failure due to insufficient memory a message will be emitted.
There are also changes outside the scope of DLA: the existing API to specify and query the workspace size (setMaxWorkspaceSize, getMaxWorkspaceSize) has been deprecated and integrated into the new API. Also, the default workspace size has been updated to the device global memory size, and the TensorRT samples have had their specific workspace sizes removed in favor of the new default value. Refer to Customizing DLA Memory Pools section in the NVIDIA TensorRT Developer Guide for more details.
Added support for NVIDIA BlueField^®-2 data processing units (DPUs), both A100X and A30X variants when using the ARM Server Base System Architecture (SBSA) packages.
Added support for NVIDIA JetPack 5.0 users. NVIDIA Xavier and NVIDIA Orin™ based devices are supported.
Added support for the dimensions labeled with the same subscript in IEinsumLayer to be broadcastable.
Added asymmetric padding support for 3D or dilated deconvolution layers on sm70+ GPUs, when the accumulation of kernel size is equal to or less than 32.

Deprecated API Lifetime

APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.
APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

TensorRT 8.4.0 EA has been tested with the following:
This TensorRT release supports NVIDIA CUDA^®:
It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Limitations

When static linking with cuDNN, cuBLAS, and cuBLASLt libraries, TensorRT requires CUDA >=11.3.
TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
3D Asymmetric Padding is not supported on GPUs older than the NVIDIA Volta GPU architecture (compute capability 7.0).

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.4.0 EA:

The following C++ API functions and classes were deprecated:
- IFullyConnectedLayer
- getMaxWorkspaceSize
- setMaxWorkspaceSize
The following Python API functions and classes were deprecated:
- IFullyConnectedLayer
- get_max_workspace_size
- set_max_workspace_size
The --workspace flag in trtexec has been deprecated. TensorRT now allocates as much workspace as available GPU memory by default when the --workspace/--memPoolSize flags are not added, instead of having 16MB default workspace size limit in the trtexec in TensorRT 8.2. To limit the workspace size, use the --memPoolSize=workspace:<size> flag instead.
The IFullyConnectedLayer operation is deprecated. Typically, you should replace it with IMatrixMultiplyLayer. The MatrixMultiply layer does not support all data layouts supported by the FullyConnected layer currently, so additional work may be required when using BuilderFlag::kDIRECT_IO, if the input of the MatrixMultiply layer is a network I/O tensor:
- If the MatrixMultiply layer is forced to INT8 precision via a combination of
  - ILayer::setPrecision(DataType::kINT8)
  - IBuilderConfig::setFlag(BuilderFlag::kOBEY_PRECISION_CONSTRAINTS)
  the engine will fail to build.
- If the MatrixMultiply layer is prefered to run on DLA and GPU fallback is allowed via a combination of
  - ```
  IBuilderConfig->setDeviceType(matrixMultiplyLayer,
      DeviceType::kDLA)
```
- IBuilderConfig->setFlag(BuilderFlag::kGPU_FALLBACK)
the layer will fall back to run on the GPU.
- If the MatrixMultiply layer is required to run on DLA and GPU fallback is not allowed via
  - ```
  IBuilderConfig->setDeviceType(matrixMultiplyLayer,
      DeviceType::kDLA)
```
the engine will fail to build.
To resolve these issues, either relax one of the constraints, or use IConvolutionLayer to create a Convolution 1x1 layer to replace IFullyConnectedLayer.

Refer to the MNIST API samples (C++, Python) for examples of migrating from IFullyConnectedLayerto IMatrixMultiplyLayer.

Fixed Issues

The EngineInspector detailed layer information always showed batch size = 1 when the engine was built with implicit batch dimension. This issue has been fixed in this release.
The IElementWiseLayer and IUnaryLayer layers can accept different input datatypes depending on the operation that is used. The documentation was updated to explicitly show which datatypes are supported. For more information, refer to the IElementWiseLayer and IUnaryLayer sections in the TensorRT Developer Guide.
When running ONNX models with dynamic shapes, there was a potential accuracy issue if the dimension names of the inputs that were expected to be the same were not. For example, if a model had two 2D inputs of which the dimension semantics were both batch and seqlen, and in the ONNX model, the dimension name of the two inputs were different, there was a potential accuracy issue when running with dynamic shapes. This issue has been fixed in this release.
There was an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer. This regression has been fixed in this release.
The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

Known Issues

Functional

There is a known functional issue when running networks containing 3D deconvolution layers on L4T.
There is a known functional issue when running networks containing convolution layers on K80.
A small portion of the data of the inference results of the LSTM graph of a specific pattern is non-deterministic occasionally.
If a network has a Gather layer with both indices and input dynamic and the optimization profile has a large dynamic range (difference between max and min), TensorRT could request a very large workspace.
For the HuggingFace demos, the T5-3B model has only been verified on A100, and is not expected to work on A10, T4, etc.
For a quantized (QAT) network with ConvTranspose followed by BN, ConvTranspose will be quantized first and then BN will be fused to ConvTranspose. This fusion is wrong and causes incorrect outputs.
A small portion of the LSTM graph, in which multiple MatMul layers have opA/opB==kTRANSPOSE consuming the same input tensor, may fail to build the engine.
During the graph optimization, new nodes are added but there is no mechanism preventing the duplication of node names.
TensorRT in FP16 mode does not perform cast operations correctly when only the output types are set, but not the layer precisions.
TensorRT does not preserve precision for operations that are imported from ONNX models in FP16 mode.
The TensorRT plugins library uses a logger that is not thread-safe which can cause data races
There is a potential memory leak while running models containing the Einsum op.
There is a known issue when ProfilingVerbosity is set to kDETAILED, the enqueueV2() call may take up to 2ms compared to ProfilingVerbosity=kNONE or kLAYER_NAMES_ONLY.
TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
You may see the following error:
```
Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
        object file: No such file or directory
```
after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
For some networks, using batch sizes larger than 32 may cause accuracy degradation on DLA.
Certain spatial dimensions may cause crashes during DLA optimization for models using single-channel inputs.
Networks that use certain pointwise operations not preceded by convolutions or deconvolutions and followed by slicing on spatial dimensions may crash in the optimizer.
The builder may require up to 60% more memory to build an engine.
If the TensorRT Python bindings are used without a GPU present, such as when the NVIDIA Container Toolkit is not installed or enabled before running Docker, then you may encounter an infinite loop which requires the process to be killed in order to terminate the application.

Performance

For certain networks built for the Xavier GPU, the deserialized engine may allocate more GPU memory than necessary.
Some networks may see a small increase in deserialization time.
Due to the difference in DLA hardware specification between Orin and Xavier, a relative increase in latency is expected when running DLA FP16 operations involving convolution (which includes deconvolution, fully-connected, and concat) on Orin as compared to running on Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution operations on Orin are expected to be about 4x faster than on Xavier, whereas FP16 convolution operations on Orin are expected to be about 40% slower than on Xavier.
There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package release notes for details.
For transformer based networks such as BERT and GPT, TensorRT can consume CPU memory up to 10 times the model size during compilation.
There is an up to 17% performance regression for DeepASR networks at BS=1 on Turing GPUs.
If a Pointwise operation has 2 inputs, then a fusion may not be possible leading to lower performance. For example, MatMul and Sigmoid can typically be fused to ConvActFusion but not in this scenario.
There is an up to 15% performance regression for MaskRCNN-ResNet-101 on Turing GPUs in INT8 precision.
There is an up to 23% performance regression for Jasper networks on Volta and Turing GPUs in FP32 precision.
There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
- up to 12% in FP32 mode. This will be fixed in a future release.
- up to 10% in FP16 mode on NVIDIA Maxwell^® and Pascal GPUs.
There is an up to 10-11% performance regression on Xavier:
- compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
- compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
There is an up to 5% performance drop for networks using sparsity in FP16 precision.

TensorRT Release 8.2.5

These are the TensorRT 8.2.5 Release Notes and are applicable to x86 Linux and Windows users. This release incorporates ARM^® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT release as well as the following additional changes.

These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

TensorRT 8.2.5 has been tested with the following:
This TensorRT release supports NVIDIA CUDA^®:
It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Fixed Issues

There is a fast configuration for the SoftMax kernel which was not enabled previously when porting it from cuDNN. This performance regression has been fixed in this release.
The Scale kernel had previously incorrectly supported strides (due to concat and slice elision). This issue has been fixed with this release.

Known Issues

Functional

TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
You may see the following error:
```
"Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
        object file: No such file or directory"
```
after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.

Performance

There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
- up to 12% in FP32 mode. This will be fixed in a future release.
- up to 10% in FP16 mode on NVIDIA Maxwell^® and Pascal GPUs.
There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
There is an up to 10-11% performance regression on Xavier:
- compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
- compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
The builder may require up to 60% more memory to build an engine.
There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
There is an up to 5% performance drop for networks using sparsity in FP16 precision.
There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

TensorRT Release 8.2.4

These are the TensorRT 8.2.4 Release Notes and are applicable to x86 Linux and Windows users. This release incorporates ARM^® based CPU cores for Server Base System Architecture (SBSA) users on Linux only. This release includes several fixes from the previous TensorRT release as well as the following additional changes.

These Release Notes are also applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

TensorRT 8.2.4 has been tested with the following:
This TensorRT release supports NVIDIA CUDA^®:
It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Fixed Issues

UBSan issues had not been discussed in the documentation. We’ve added a new section that discusses Issues With Undefined Behavior Sanitizer in the TensorRT Developer Guide.
There was a functional issue when horizontal merge is followed by a concat layer with axis on non-channel dimension. The issue is fixed in this release.
For a network with floating-point output, when the configuration allows using INT8 in the engine, TensorRT has a heuristic for avoiding excess quantization noise in the output. Previously, the heuristic assumed that plugins were capable of floating-point output if needed, and otherwise the engine failed to build. Now, the engine will build, although without trying to avoid quantization noise from an INT8 output from a plugin. Furthermore, a plugin with an INT8 output that is connected to a network output of type INT8 now works.
TensorRT was incorrectly computing the size of tensors when doing memory allocations computation. This occurred in cases where dynamic shapes was triggering an integer overflow on the max opt dimension when accumulating the volumes of all the network I/O tensors.
TensorRT incorrectly performed horizontal fusion of batched Matmuls along the batch dimension. The issue is fixed in this release.
In some cases TensorRT failed to find a tactic for Pointwise. The issue is fixed in this release.
There was a functional issue in fused reduction kernels which would lead to an accuracy drop in WeNet transformer encoder layers. The issue is fixed in this release.
There were functional issues when two layer's inputs (or outputs) shared the same IQuantization/IDequantization layer. The issue is fixed in this release.
When fusing Convolution+Quantization or Pointwise+Quantization and the output type is constrained to INT8, you had to specify the precision for Convolution and Pointwise operations for other fusions to work correctly. If the Convolution and Pointwise precision had not been configured yet, it would have to be float because INT8 precision requires explicitly fusing with Dequantization. The issue is fixed in this release.
There was a known crash when building certain large GPT2-XL model variants. The issue is fixed in this release.

Known Issues

Functional

TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
The Debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
You may see the following error:
```
"Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
        object file: No such file or directory"
```
after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.

Performance

There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
- up to 12% in FP32 mode. This will be fixed in a future release.
- up to 10% in FP16 mode on NVIDIA Maxwell^® and Pascal GPUs.
There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
There is an up to 10-11% performance regression on Xavier:
- compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
- compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
The builder may require up to 60% more memory to build an engine.
There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
There is an up to 5% performance drop for networks using sparsity in FP16 precision.
There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

TensorRT Release 8.2.3

This is the TensorRT 8.2.3 release notes and is applicable to x86 Linux and Windows users, as well as incorporates ARM^® based CPU cores for Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the NVIDIA TensorRT Archived Documentation.

Deprecated API Lifetime

APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

TensorRT 8.2.3 has been tested with the following:
This TensorRT release supports NVIDIA CUDA^®:
It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Fixed Issues

There was a known issue where using custom allocator and allocation resizing introduced in TensorRT 8 that would trigger an assert about a p.second failure. This was caused by the application passing to TensorRT the same exact pointer from the re-allocation routine. This assertion has been fixed to be a valid use case.
There was an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer. This issue has been fixed in this release.
There was an up to 20% performance regression for INT8 QAT networks with a Padding layer located before the Q/DQ and Convolution layer. This issue has been fixed in this release.
TensorRT does not support explicit quantization (i.e. Q/DQ) for batched matrix-multiplication. This fix introduces support for the special case of quantized batched matrix-multiplication when matrix B is constant and can be squeezed to a 2D matrix. Specifically, in the supported configuration matrix A (the data) can have shape (BS, M, K), where BS is the batch size, and matrix B (the weights) can have shape (1, K, N). The output has shape (BS, M, N) which is computed by broadcasting the weights across the batch dimension. Quantized batched matrix-multiplication has two pairs of Q/DQ nodes that quantize the input data and the weights.
An incorrect fusion of two transpose operations caused an assertion to trigger while building the model. This issue has been fixed in this release.

Known Issues

Functional

TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM.
Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
You may see the following error:
```
Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
        object file: No such file or directory
```
after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.

Performance

There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
- up to 12% in FP32 mode. This will be fixed in a future release.
- up to 10% in FP16 mode on NVIDIA Maxwell^® and Pascal GPUs.
There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
There is an up to 10-11% performance regression on Xavier:
- compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
- compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
The builder may require up to 60% more memory to build an engine.
There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
There is an up to 5% performance drop for networks using sparsity in FP16 precision.
There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

TensorRT Release 8.2.2

This is the TensorRT 8.2.2 release notes and is applicable to x86 Linux and Windows users, as well as incorporates ARM^® based CPU cores for Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

Deprecated API Lifetime

APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

TensorRT 8.2.2 has been tested with the following:
This TensorRT release supports NVIDIA CUDA^®:
It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Fixed Issues

In order to install TensorRT using the pip wheel file, you had to ensure that the pip version was less than 20. An older version of pip was required to workaround an issue with the CUDA 11.4 wheel meta packages that TensorRT depends on. This issue has been fixed in this release.
An empty directory named deserializeTimer under the samples directory was left in the package by accident. This issue has been fixed in this release.
For some transformer based networks built with PyTorch Multi-head Attention API, the performance could have been up to 45% slower than similar networks built with other APIs due to different graph patterns. This issue has been fixed in this release.
IShuffleLayer applied to the output of IConstantLayer was incorrectly transformed when the constant did not have type kFLOAT, sometimes causing build failures. This issue has been fixed in this release.
ONNX models with MatMul operations that used QuantizeLinear/DequantizeLinear operations to quantize the weights, and pre-transpose the weights (i.e. do not use a separate Transpose operation) would suffer from accuracy errors due to a bug in the quantization process. This issue has been fixed in this release.

Known Issues

Functional

TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM. To workaround this issue, disable CUBLAS_LT kernels with --tacticSources=-CUBLAS_LT (setTacticSources).
Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
You may see the following error:
```
Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
        object file: No such file or directory
```
after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.

Performance

There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
There is an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer.
There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
- up to 12% in FP32 mode. This will be fixed in a future release.
- up to 10% in FP16 mode on NVIDIA Maxwell^® and Pascal GPUs.
There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
There is an up to 10-11% performance regression on Xavier:
- compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
- compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
The builder may require up to 60% more memory to build an engine.
There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
There is an up to 5% performance drop for networks using sparsity in FP16 precision.
There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

TensorRT Release 8.2.1

This is the TensorRT 8.2.1 release notes and is applicable to x86 Linux and Windows users, as well as incorporates ARM^® based CPU cores for Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for Jetson platforms).

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.

WSL (Windows Subsystem for Linux) 2 is released as a preview feature in this TensorRT 8.2.1 GA release.

Deprecated API Lifetime

APIs deprecated prior to TensorRT 8.0 will be removed in TensorRT 9.0.
APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
APIs deprecated in TensorRT 8.2 will be retained until at least 11/2022.

Refer to the API documentation (C++, Python) for how to update your code to remove the use of deprecated features.

Compatibility

TensorRT 8.2.1 has been tested with the following:
This TensorRT release supports NVIDIA CUDA^®:
It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Limitations

DLA does not support hybrid precision for pooling layer – data type of input and output tensors should be the same as the layer precision i.e. either all INT8 or all FP16.

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.2.1:

BuilderFlag::kSTRICT_TYPES is deprecated. Its functionality has been split into separate controls for precision constraints, reformat-free I/O, and failure of IAlgorithmSelector::selectAlgorithms. This change enables users who need only one of the subfeatures to build engines without encumbering the optimizer with the other subfeatures. In particular, precision constraints are sometimes necessary for engine accuracy, however, reformat-free I/O risks slowing down an engine with no benefit. For more information, refer to BuilderFlags (C++, Python).
The LSTM plugin has been removed. In addition, the Persistent LSTM Plugin section has also been removed from the TensorRT Developer Guide.

Fixed Issues

Closed the performance gap between linking with TensorRT static libraries and linking with TensorRT dynamic libraries on x86_64 Linux CUDA-11.x platforms.
When building a DLA engine with:
- networks with less than 4D tensors, some DLA subgraph IO tensors would lack shuffle layers. It would fail compiling the engine. This issue has been fixed in this release.
- kSUB ElementWise operation whose input has less than 4 dimensions, a scale node was inserted. If the scale cannot run on DLA, it would fail compiling the engine. This issue has been fixed in this release.
There was a known issue with the engine_refit_mnist sample on Windows. The fix was to edit the engine_refit_mnist/sample.py source file and move the import model line before the sys.path.insert() line. This issue has been fixed in this release and no edit is needed.
There was an up to 22% performance regression compared to TensorRT 8.0 for WaveRNN networks on NVIDIA Volta and NVIDIA Turing GPUs. This issue has been fixed in this release.
There was an up to 21% performance regression compared to TensorRT 8.0 for BERT-like networks on NVIDIA Jetson Xavier platforms. This issue has been fixed in this release.
There was an up to 25% performance drop for networks using the InstanceNorm plugin. This issue has been fixed in this release.
There was an accuracy bug resulting in low mAP score with YOLO-like QAT networks where a QuantizeLinear operator was immediately followed by a Concat operator. This accuracy issue has been fixed in this release.
There was a bug in TensorRT 8.2.0 EA where if a shape tensor is used by two different nodes, it can sometimes lead to a functional or an accuracy issue. This issue has been fixed in this release.
There was a known 2% accuracy regression with NasNet Mobile network with NVIDIA Turing GPUs. This issue has been fixed in this release. (not applicable for Jetson platforms)
There could have been build failures with IEinsumLayer when an input subscript label corresponds to a static dimension in one tensor but dynamic dimension in another. This issue has been fixed in this release.
There was an up to 8% performance regression compared to TensorRT 8.0.3 for Cortana ASR on NVIDIA Ampere Architecture GPUs with CUDA graphs and a single stream of execution. This issue has been fixed in this release.
Boolean input/output tensors were only supported when using explicit batch dimensions. This has been fixed in this release.
There was a possibility of an CUBLAS_STATUS_EXECUTION_FAILED error when running with cuBLAS/cuBLASLt libraries from CUDA 11.4 update 1 and CUDA 11.4 update 2 on Linux-based platforms. This happened only for the use cases where cuBLAS is loaded and unloaded multiple times. The workaround was to add the following environment variable before launching your application:
```
LD_PRELOAD=libcublasLt.so:libcublasLt.so your_application
```
This issue has been fixed in this release.
There was a known ~6% - ~29% performance regression on Google® BERT compared to version 8.0.1.6 on NVIDIA A100 GPUs. This issue has been fixed in this release.
There was an up to 6% performance regression compared to TensorRT 8.0.3 for Deep Recommender on Tesla V100, NVIDIA Quadro® GV100, and NVIDIA TITAN V. This issue has been fixed in this release.

Announcements

The sample sample_reformat_free_io has been renamed to sample_io_formats, and revised to remove the deprecated flag BuilderFlag::kSTRICT_TYPES. Reformat-free I/O is still available with BuilderFlag::kDIRECT_IO, but generally should be avoided since it can result in a slower than necessary engine, and can cause a build to fail if the target platform lacks the kernels to enable building an engine with reformat-free I/O.
The NVIDIA TensorRT Release Notes PDF will no longer be available in the product package after this release. The release notes will still remain available online here.

Known Issues

Functional

TensorRT attempts to catch GPU memory allocation failure and avoid profiling tactics whose memory requirements would trigger Out of Memory. However, GPU memory allocation failure cannot be handled by CUDA gracefully on some platforms and would lead to an unrecoverable application status. If this happens, consider lowering the specified workspace size if a large size is set, or using the IAlgorithmSelector interface to avoid tactics that require a lot of GPU memory.
TensorRT may experience some instabilities when running networks containing TopK layers on T4 under Azure VM. To workaround this issue, disable CUBLAS_LT kernels with --tacticSources=-CUBLAS_LT (setTacticSources).
Under certain conditions on WSL2, an INetwork with Convolution layers that can be horizontally fused before a Concat layer may create an internal error causing the application to crash while building the engine. As a workaround, build your network on Linux instead of WSL2.
When running ONNX models with dynamic shapes, there is a potential accuracy issue if the dimension names of the inputs that are expected to be the same are not. For example, if a model has two 2D inputs of which the dimension semantics are both batch and seqlen, and in the ONNX model, the dimension name of the two inputs are different, there is a potential accuracy issue when running with dynamic shapes. Ensure you the dimension semantics match when exporting ONNX models from frameworks.
There is a known functional issue (fails with a CUDA error during compilation) with networks using ILoop layers on the WSL platform.
The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
For some transformer based networks built with PyTorch Multi-head Attention API, the performance may be up to 45% slower than similar networks built with other APIs due to different graph patterns.
TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
TensorRT has limited support for fusing IConstantLayer and IShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
Hybrid precision is not supported with the Pooling layer. Data type of input and output tensors should be the same as the layer precision.
When installing PyCUDA, NumPy must be installed first and as a separate step:
```
python3 -m pip install numpy 
python3 -m pip install pycuda
```
For more information, refer to the NVIDIA TensorRT Installation Guide.
When running the Python engine_refit_mnist, network_api_pytorch_mnist, or onnx_packnet samples, you may encounter Illegal instruction (core dumped) when using the CPU version of PyTorch on Jetson TX2. The workaround is to install a GPU enabled version of PyTorch as per the instructions in the sample READMEs.
If an IPluginV2 layer produces kINT8 outputs that are output tensors of an INetworkDefinition that have floating-point type, an explicit cast is required to convert the network outputs back to a floating point format. For example:
```
// out_tensor is of type nvinfer1::DataType::kINT8
auto cast_input = network->addIdentity(*out_tensor);
cast_input->setOutputType(0, nvinfer1::DataType::kFLOAT);
new_out_tensor = cast_input->getOutput(0);
```
Intermittent accuracy issues are observed in sample_mnist with INT8 precision on WSL2.
The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and ONNX-GraphSurgeon wheels do not install their dependencies automatically; when installing them, ensure you install the dependencies manually using pip, or install the wheels instead.
You may see the following error:
```
Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
        object file: No such file or directory
```
after installing TensorRT from the network repo. cuDNN depends on the RPM dependency libcublas.so.11()(64bit), however, this dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest CUDA release. The library search path will not be set up correctly and cuDNN will be unable to find the cuBLAS libraries. The workaround is to install the latest libcublas-11-x package manually.
There is a known issue on Windows with the Python sample uff_ssd when converting the frozen TensorFlow graph into UFF. You can generate the UFF model on Linux or in a container and copy it over to work around this issue. Once generated, copy the UFF file to \path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
ONNX models with MatMul operations that use QuantizeLinear/DequantizeLinear operations to quantize the weights, and pre-transpose the weights (i.e. do not use a separate Transpose operation) will suffer from accuracy errors due to a bug in the quantization process.

Performance

There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6 on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16 mode.
There is an up to 15% performance regression for networks with a Pooling layer located before or after a Concatenate layer.
There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
- up to 12% in FP32 mode. This will be fixed in a future release.
- up to 10% in FP16 mode on NVIDIA Maxwell^® and Pascal GPUs.
There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on NVIDIA Jetson Nano™.
There is an up to 10-11% performance regression on Xavier:
- compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
- compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
The builder may require up to 60% more memory to build an engine.
There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on NVIDIA Volta GPUs.
There is an up to 5% performance drop for networks using sparsity in FP16 precision.
There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.
The engine building time for the networks using 3D convolution, like 3d_unet, is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being added in, which enlarges the profiling time.

TensorRT Release 8.2.0 Early Access (EA)

This is the TensorRT 8.2.0 Early Access (EA) release notes and is applicable to x86 Linux and Windows users, as well as ARM Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.

Added support for the TensorRT Python API on Windows.
Improved the quality of the TensorRT Developer Guide.
- Rewrote multiple chapters.
- Added a new chapter on Working With Conditionals.
Eliminated the global logger; each Runtime, Builder or Refitter now has its own logger. New methods IBuilder::getLogger(), IRuntime::getLogger(), and IRefitter::getLogger() have been added.
Added three new APIs to IExecutionContext: getEnqueueEmitsProfile(), setEnqueueEmitsProfile(), and reportToProfiler()which can be used to collect layer profiling info when the inference is launched as a CUDA graph.
Added the following:
- New operators:IAssertionLayer, IConditionLayer, IEinsumLayer, IIfConditionalBoundaryLayer, IIfConditionalOutputLayer, IIfConditionalInputLayer, and IScatterLayer.
- New IGatherLayer modes: kELEMENT and kND
- New ISliceLayer modes: kFILL, kCLAMP, and kREFLECT
- New IUnaryLayer operators: kSIGN and kROUND
Added a new runtime class: IEngineInspector that can be used to inspect the detailed information of an engine, including the layer parameters, the chosen tactics, the precision used, etc. More instructions about IEngineInspector can be found in the TensorRT Developer Guide.
- Added new trtexec flags --dumpLayerInfo and --exportLayerInfo=<file> that can be used together with the --profilingVerbosity=detailed flag to inspect the detailed information of a given engine using IEngineInspector.
This sample, efficientnet, shows how to convert and execute a Google EfficientNet model with TensorRT. The sample supports models from the original EfficientNet implementation as well as newer EfficientNet V2 models. For more information, refer to Scalable And Efficient Image Classification With EfficientNet Networks In Python in the TensorRT Sample Support Guide.

Breaking API Changes

Between TensorRT 8.0 EA and TensorRT 8.0 GA the function prototype for getLogger() has been moved from NvInferRuntimeCommon.h to NvInferRuntime.h. You may need to update your application source code if you’re using getLogger() and were previously only including NvInferRuntimeCommon.h. Since the logger is no longer global, calling the method on IRuntime, IRefitter, or IBuilder is recommended instead.

Compatibility

TensorRT 8.2.0 EA has been tested with the following:
This TensorRT release supports CUDA:
It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.2.0 EA:

Removed sampleMLP.
The enums of ProfilingVerbosity have been updated to show their functionality more explicitly:
- ProfilingVerbosity::kDEFAULT has been deprecated in favor of ProfilingVerbosity::kLAYER_NAMES_ONLY.
- ProfilingVerbosity::kVERBOSE has been deprecated in favor of ProfilingVerbosity::kDETAILED.
Several flags of trtexec have been deprecated:
- --explicitBatch flag has been deprecated and has no effect. When the input model is in UFF or in Caffe prototxt format, the implicit batch dimension mode is used automatically; when the input model is in ONNX format, the explicit batch mode is used automatically.
- --explicitPrecision flag has been deprecated and has no effect. When the input ONNX model contains Quantization/Dequantization nodes, TensorRT automatically uses explicit precision mode.
- --nvtxMode=[verbose|default|none] has been deprecated in favor of --profilingVerbosity=[detailed|layer_names_only|none] to show its functionality more explicitly.
Relocated the content from the Best Practices For TensorRT Performance document to the TensorRT Developer Guide. Removed redundancy between the two documents and updated the reference information. Refer to Performance Best Practices in the TensorRT Developer Guide for more information.
IPaddingLayer is deprecated in TensorRT 8.2 and will be removed in TensorRT 10.0. Use ISliceLayer to pad the tensor, which supports new non-constant, reflects padding mode and clamp, and supports padding output with dynamic shape.

Fixed Issues

Closed the performance gap between linking with TensorRT static libraries and linking with TensorRT dynamic libraries.
In the previous release, the TensorRT ARM SBSA cross packages in the CUDA network repository could not be installed because cuDNN ARM SBSA cross packages were not available, which is a dependency of the TensorRT cross packages. The cuDNN ARM SBSA cross packages have been made available, which resolves this dependency issue.
There was an up to 6% performance regression compared to TensorRT 7.2.3 for WaveRNN in FP16 on Volta and Turing platforms. This issue has been fixed in this release.
There was a known accuracy issue of GoogLeNet variants with NVIDIA Ampere GPUs where TF32 mode was enabled by default on Windows. This issue has been fixed in this release. (not applicable for Jetson platforms)
There was an up to 10% performance regression when TensorRT was used with cuDNN 8.1 or 8.2. When cuDNN 8.0 was used, the performance was restored. This issue has been fixed in this release. (not applicable for Jetson platforms)
The new Python sample efficientdet was only available in the OSS release. The sample has been added to the core package in this release.
The performance of IReduceLayer has been improved significantly when the output size of the IReduceLayer is small.
There was an up to 15% performance regression compared to TensorRT 7.2.3 for path perception network (Pathnet) in FP32. This issue has been fixed in this release.

Announcements

Python support for Windows included in the zip package is considered a preview release and not ready for production use.

Known Issues

Functional

The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
For some networks with large amounts of weights and activation data, DLA may fail compiling a subgraph, and that subgraph will fallback to GPU.
Under some conditions, RNNv2Layer can require a larger workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
CUDA graph capture will capture inputConsumed and profiler events only when using the build for 11.x and >= 11.1 driver (455 or above).
On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was artificially restricting the amount of available memory has been fixed. A side effect is that the TensorRT optimizer is able to choose layer implementations that use more memory, which can cause the OOM Killer to trigger for networks where it previously didn't. To work around this problem, use the IAlgorithmSelector interface to avoid layer implementations that require a lot of memory, or use the layer precision API to reduce precision of large tensors and use STRICT_TYPES, or reduce the size of the input tensors to the builder by reducing batch or other higher dimensions.
For some transformer based networks built with PyTorch MultiheadAttention API, the performance may be up to 45% slower than similar networks built with other APIs due to different graph patterns.
When building a DLA engine with:
- networks with less than 4D tensors, some DLA subgraph IO tensors may lack shuffle layers. It will fail compiling the engine.
- kSUB ElementWise operation whose input has less than 4 dimensions, a scale node is inserted. If the scale cannot run on DLA, it will fail compiling the engine.
There is a known 2% accuracy regression with NasNet Mobile network with NVIDIA Turing GPUs. (not applicable for Jetson platforms)
TensorRT bundles a version of libnvptxcompiler_static.a inside libnvinfer_static.a. If an application links with a different version of PTXJIT than the version used to build TensorRT, it may lead to symbol conflicts or undesired behavior.
Boolean input/output tensors are supported only when using explicit batch dimensions.
There is a possibility of an CUBLAS_STATUS_EXECUTION_FAILED error when running with cuBLAS/cuBLASLt libraries from CUDA 11.4 update 1 and CUDA 11.4 update 2 on Linux-based platforms. This happens only for the use cases where cuBLAS is loaded and unloaded multiple times. The workaround is to add the following environment variable before launching your application:
```
LD_PRELOAD=libcublasLt.so:libcublasLt.so your_application
```
This issue will be resolved in a future CUDA 11.4 update.
Installing the cuda-compat-11-4 package may interfere with CUDA enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470. (not applicable for Jetson platforms)
There may be build failures with IEinsumLayer when an input subscript label corresponds to a static dimension in one tensor but dynamic dimension in another.
TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
TensorRT has limited support for fusing ConstantLayer and ShuffleLayer. In explicit-quantization mode, the weights of Convolutions and Fully-Connected layers must be fused. Therefore, if a weights-shuffle is not supported, it may lead to failure to quantize the layer.
There is a known issue with the engine_refit_mnist sample on Windows. To fix the issue, edit the engine_refit_mnist/sample.py source file and move the import model line before the sys.path.insert() line.
An empty directory named deserializeTimer under the samples directory was left in the package by accident. This empty directory can be ignored and does not indicate that the package is corrupt or that files are missing. This issue will be corrected in the next release.
For DLA networks where a convolution layer consumes an NHWC network input, the compute precision of the convolution layer must match the data type of the input tensor.
In order to install TensorRT using the pip wheel file, ensure that the pip version is less than 20. An older version of pip is required to workaround an issue with the CUDA 11.4 wheel meta packages that TensorRT depends on. This issue is being worked on and should be resolved in a future CUDA release.

Performance

There is a performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
- up to 12% in FP32 mode. This will be fixed in a future release.
- up to 10% in FP16 mode on Maxwell and Pascal GPUs.
There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on Nano.
There is an up to 10-11% performance regression on Xavier:
- compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
- compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
For networks that use deconv with large kernel size, the engine build time can increase a lot for this layer on Xavier. It can also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA 11.0. (not applicable for Jetson platforms)
There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
TensorFlow 1.x is not supported for Python 3.9. Any Python samples that depend on TensorFlow 1.x cannot be run with Python 3.9.
DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve accuracy. Thus, latency may be worse compared to an equivalent network using a different activation like ReLU. To mitigate this, you can disable LeakyReLU layers from running on DLA.
The builder may require up to 60% more memory to build an engine.
There is a known ~6% - ~29% performance regression on Google BERT compared to version 8.0.1.6 on NVIDIA A100 GPUs.
There is an up to 6% performance regression compared to TensorRT 8.0.3 for Deep Recommender on Tesla V100, Quadro GV100, and Titan V.
There is an up to 22% performance regression compared to TensorRT 8.0 for WaveRNN networks on Volta and Turing GPUs. This issue is being investigated.
There is up to 8% performance regression compared to TensorRT 8.0.3 for Cortana ASR on NVIDIA Ampere GPUs with CUDA graphs and a single stream of execution.
There is an up to 21% performance regression compared to TensorRT 8.0 for BERT-like networks on Xavier platforms. This issue is being investigated.
There is an up to 126% performance drop when running some ConvNets on DLA in parallel to the other DLA and the iGPU on Xavier platforms, compared to running on DLA alone.
There is an up to 21% performance drop compared to TensorRT 8.0 for SSD-Inception2 networks on Volta GPUs.
There is an up to 25% performance drop for networks using the InstanceNorm plugin. This issue is being investigated.

TensorRT Release 8.0.3

This is the TensorRT 8.0.3 release notes. This is a bug fix release supporting Linux x86 and Windows users.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Fixed Issues

Fixed an invalid fusion assertion problem in the fusion optimization pass.
Fixed other miscellaneous issues seen in proprietary networks.
Fixed a CUDA 11.4 NVRTC issue during kernel generation on Windows.

TensorRT Release 8.0.2

This is the TensorRT 8.0.2 release notes. This is the initial release supporting A100 for ARM server users. Only a subset of networks have been validated on ARM with A100. This is a network repository release only.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

TensorRT Release 8.0.1

This is the TensorRT 8.0.1 release notes and is applicable to x86 Linux and Windows users, as well as PowerPC and ARM Server Base System Architecture (SBSA) users on Linux only.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 8.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.

Added support for RedHat/CentOS 8.3, Ubuntu 20.04, and SUSE Linux Enterprise Server 15 Linux distributions. Only a tar file installation is supported on SLES 15 at this time. For more information, refer to the TensorRT Installation Guide.
Added Python 3.9 support. Use a tar file installation to obtain the new Python wheel files. For more information, refer to the TensorRT Installation Guide.
Added ResizeCoordinateTransformation, ResizeSelector, and ResizeRoundMode; three new enumerations to IResizeLayer, and enhanced IResizeLayer to support more resize modes from TensorFlow, PyTorch, and ONNX. For more information, refer to the IResizeLayer section in the TensorRT Developer Guide.
Builder timing cache can be serialized and reused across builder instances. For more information, refer to the Builder Layer Timing Cache and trtexec sections in the TensorRT Developer Guide.
Added convolution and fully-connected tactics which support and make use of structured sparsity in kernel weights. This feature can be enabled by setting the kSPARSE_WEIGHTS flag in IBuilderConfig. This feature is only available on NVIDIA Ampere GPUs. For more information, refer to the Structured Sparsity section in the Best Practices For TensorRT Performance guide. (not applicable for Jetson platforms)
Added two new layers to the API: IQuantizeLayer and IDequantizeLayer which can be used to explicitly specify the precision of operations and data buffers. ONNX’s QuantizeLinear and DequantizeLinear operators are mapped to these new layers which enables the support for networks trained using Quantization-Aware Training (QAT) methodology. For more information, refer to the Explicit-Quantization, IQuantizeLayer, and IDequantizeLayer sections in the TensorRT Developer Guide and Q/DQ Fusion in the Best Practices For TensorRT Performance guide.
Achieved QuartzNet optimization with support of 1D fused depthwise + pointwise convolution kernel to achieve up to 1.8x end-to-end performance improvement on A100. (not applicable for Jetson platforms)
Added support for the following ONNX operators: Celu, CumSum, EyeLike, GatherElements, GlobalLpPool, GreaterOrEqual, LessOrEqual, LpNormalization, LpPool, ReverseSequence, and SoftmaxCrossEntropyLoss. For more information, refer to the Supported Ops section in the TensorRT Support Matrix.
Added Sigmoid/Tanh INT8 support for DLA. It allows DLA sub-graph with Sigmoid/Tanh to compile with INT8 by auto-upgrade to FP16 internally. For more information, refer to the DLA Supported Layers section in the TensorRT Developer Guide.
Added DLA native planar format and DLA native gray-scale format support.
Allow to generate reformat-free engine with DLA when EngineCapability is EngineCapability::kDEFAULT.
TensorRT now declares API’s with the noexcept keyword to clarify that exceptions must not cross the library boundary. All TensorRT classes that an application inherits from (such as IGpuAllocator, IPluginV2, etc…) must guarantee that methods called by TensorRT do not throw uncaught exceptions, or the behavior is undefined.
TensorRT reports errors, along with an associated ErrorCode, via the ErrorRecorder API for all errors. The ErrorRecorder will fallback to the legacy logger reporting, with Severity::kERROR or Severity::kINTERNAL_ERROR, if no error recorder is registered. The ErrorCodes allow recovery in cases where TensorRT previously reported non-recoverable situations.
Improved performance of the GlobalAveragePooling operation, which is used in some CNNs like EfficientNet. For transformer based networks with INT8 precision, it’s recommended to use a network which is trained using Quantization Aware Training (QAT) and has IQuantizeLayer and IDequantizeLayer layers in the network definition.
TensorRT now supports refit weights via names. For more information, refer to Refitting An Engine in the TensorRT Developer Guide.
Refitting performance has been improved. The performance boost can be evident when the weights are large or a large number of weights or layers are updated at the same time.
Added the following new samples.
- This sample, engine_refit_onnx_bidaf, builds an engine from the ONNX BiDAF model, and refits the TensorRT engine with weights from the model. The new refit APIs allow users to locate the weights via names from ONNX models instead of layer names and weights roles. For more information, refer to the Refitting An Engine Built From An ONNX Model In Python in the TensorRT Sample Support Guide.
- This sample, efficientdet, demonstrates the conversion and execution of Google EfficientDet models with TensorRT. For more information, refer to the Scalable And Efficient Object Detection With EfficientDet Networks In Python in the TensorRT Sample Support Guide.
Improved performance for the transformer based networks such as BERT and other networks that use Multi-Head Self-Attention.
Added cuDNN to the IBuilderConfig::setTacticSources enum. Use of cuDNN as a source of operator implementations can be enabled or disabled using the IBuilderConfig::setTacticSources API function.
The following C++ API functions were added:
- class IDequantizeLayer
- class IQuantizeLayer
- class ITimingCache
- IBuilder::buildSerializedNetwork()
- IBuilderConfig::getTimingCache()
- IBuilderConfig::setTimingCache()
- IGpuAllocator::reallocate()
- INetworkDefinition::addDequantize()
- INetworkDefinition::addQuantize()
- INetworkDefinition::setWeightsName()
- IPluginRegistry::deregisterCreator()
- IRefitter::getMissingWeights()
- IRefitter::getAllWeights()
- IRefitter::setNamedWeights()
- IResizeLayer::getCoordinateTransformation()
- IResizeLayer::getNearestRounding()
- IResizeLayer::getSelectorForSinglePixel()
- IResizeLayer::setCoordinateTransformation()
- IResizeLayer::setNearestRounding()
- IResizeLayer::setSelectorForSinglePixel()
- IScaleLayer::setChannelAxis()
- enum ResizeCoordinateTransformation
- enum ResizeMode
- BuilderFlag::kSPARSE_WEIGHTS
- TacticSource::kCUDNN
- TensorFormat::kDLA_HWC4
- TensorFormat::kDLA_LINEAR
- TensorFormat::kHWC16
The following Python API functions were added:
- class IDequantizeLayer
- class IQuantizeLayer
- class ITimingCache
- Builder.build_serialized_network()
- IBuilderConfig.get_timing_cache()
- IBuilderConfig.set_timing_cache()
- IGpuAllocator.reallocate()
- INetworkDefinition.add_dequantize()
- INetworkDefinition.add_quantize()
- INetworkDefinition.set_weights_name()
- IPluginRegistry.deregister_creator()
- Refitter.get_all_weights()
- Refitter.get_missing_weights()
- Refitter::set_named_weights()
- IResizeLayer.coordinate_transformation
- IResizeLayer.nearest_rounding
- IResizeLayer.selector_for_single_pixel
- IScaleLayer.channel_axis
- enum ResizeCoordinateTransformationDoc
- enum ResizeMode
- BuilderFlag.SPARSE_WEIGHTS
- TacticSource.CUDNN
- TensorFormat.DLA_HWC4
- TensorFormat.DLA_LINEAR
- TensorFormat.HWC16
The memory reporting mechanism on Linux platforms has been improved to include large allocations that were not previously tracked due to being memory mapped instead of heap allocated.

Breaking API Changes

Support for Python 2 has been dropped. This means that TensorRT will no longer include wheels for Python 2, and Python samples will not work with Python 2.
All API's have been marked as noexcept where appropriate. The IErrorRecorder interface has been fully integrated into the API for error reporting. The Logger is only used as a fallback when the ErrorRecorder is not provided by the user.
Callback changes are now marked noexcept, therefore, implementations must also be marked noexcept. TensorRT has never catered to exceptions thrown by callbacks, but this is now captured in the API.
Methods that take parameters of type void** where the array of pointers is unmodifiable are now changed to take type void*const*.
Dims is now a type alias for class Dims32. Code that forward-declares Dims should forward-declare class Dims32; using Dims = Dims32;.
Between TensorRT 8.0 EA and TensorRT 8.0 GA the function prototype for getLogger() has been moved from NvInferRuntimeCommon.h to NvInferRuntime.h. You may need to update your application source code if you’re using getLogger() and were previously only including NvInferRuntimeCommon.h.

Compatibility

TensorRT 8.0.1 has been tested with the following:
This TensorRT release supports CUDA:
It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Limitations

For QAT networks, TensorRT 8.0 supports per-tensor and per-axis quantization scales for weights. For activations, only per-tensor quantization is supported. Only symmetric quantization is supported and zero-point weights may be omitted or, if zero-points are provided, all coefficients must have a value of zero.
Loops and DataType::kBOOL are not supported when the static TensorRT library is used. Performance improvements for transformer based architectures such as BERT will also not be available when using the static TensorRT library.
When using reformat-free I/O, the extent of a tensor in a vectorized dimension might not be a multiple of the vector length. Elements in a partially occupied vector that are not within the tensor are referred to here as vector-padding. For example:
- On GPU
  - for input tensors, the application shall set vector-padding elements to zero.
  - for output tensors, the value of vector-padding elements is undefined. In a future release, TensorRT will support setting them to zero.
- On DLA
  - for input tensors, vector-padding elements are ignored.
  - for output tensors, vector-padding elements are unmodified.
When running INT8 networks on DLA using TensorRT, operations must be added to the same subgraph to reduce quantization errors across the subgraph of the network that runs on the DLA by allowing them to fuse and retain higher precision for intermediate results. Breaking apart the subgraph in order to inspect intermediate results by setting the tensors as network output tensors, can result in different levels of quantization errors due to these optimizations being disabled.
If both kSPARSE_WEIGHTS and kREFIT flags are set in IBuilderConfig, the convolution layers having structured sparse kernel weights cannot be refitted with new kernel weights which do not have structured sparsity. The IRefitter::setWeights() will print an error and return false in that case.
Samples which require TensorFlow in order to run, which typically also use UFF models, are not supported on ARM SBSA releases of TensorRT 8.0. There is no good source for TensorFlow 1.15.x for AArch64 that also supports Python 3.8 which can be used to run these samples.
Using CUDA graph capture on TensorRT execution contexts with CUDA 10.2 on NVIDIA K80 GPUs may lead to graph capture failures. Upgrading to CUDA 11.0 or above will solve the issue. (not applicable for Jetson platforms)
On RHEL and CentOS, the TensorRT RPM packages for CUDA 11.3 cannot be installed alongside any CUDA 11.x Toolkit packages, like the Debian packages, due to RPM packaging limitations. The TensorRT runtime library packages can only be installed alongside CUDA 11.2 and CUDA 11.3 Toolkit packages and the TensorRT development packages can only be installed alongside CUDA 11.3 Toolkit packages. When using the TAR package, the TensorRT CUDA 11.3 build can be used with any CUDA 11.x Toolkit.

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.0.1:

Deprecation is used to inform developers that some APIs and tools are no longer recommended for use. TensorRT has the following deprecation policy:
- This policy comes into effect beginning with TensorRT 8.0.
- Deprecation notices are communicated in the release notes. Deprecated API elements are marked with the TRT_DEPRECATED macro where possible.
- TensorRT provides a 12-month migration period after the deprecation. For any APIs and tools deprecated in TensorRT 7.x, the 12-month migration period starts from the TensorRT 8.0 GA release date.
- APIs and tools will continue to work during the migration period.
- After the migration period ends, we reserve the right to remove the APIs and tools in a future release.
IRNNLayer was deprecated in TensorRT 4.0 and has been removed in TensorRT 8.0. IRNNv2Layer was deprecated in TensorRT 7.2.1. IRNNv2Layer has been deprecated in favor of the loop API, however, it is still available for backwards compatibility. For more information about the loop API, refer to the sampleCharRNN sample with the --Iloop option as well as the Working With Loops chapter.
IPlugin and IPluginFactory interfaces were deprecated in TensorRT 6.0 and have been removed in TensorRT 8.0. We recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt and IPluginV2IOExt interfaces. For more information, refer to the Migrating Plugins From TensorRT 6.x Or 7.x To TensorRT 8.x.x section.
We removed samplePlugin since it was meant to demonstrate the IPluginExt interface, which is no longer supported in TensorRT 8.0.
We have deprecated the Caffe Parser and UFF Parser in TensorRT 7.0. They are still tested and functional in TensorRT 8.0, however, we plan to remove the support in the future. Ensure you migrate your workflow to use tf2onnx, keras2onnx or TensorFlow-TensorRT (TF-TRT) for deployment.
If using UFF, ensure you migrate to the ONNX workflow through the enablement of a plugin. ONNX workflow is not dependent on plugin enablement. For plugin enablement of a plugin on ONNX, refer to Estimating Depth with ONNX Models and Custom Layers Using NVIDIA TensorRT.
- For TensorFlow to ONNX and then to TensorRT, refer to Speeding up Deep Learning Inference Using TensorFlow, ONNX, and TensorRT.
- For PyTorch to ONNX and then to TensorRT, refer to Speeding up Deep Learning Inference Using TensorRT.
Caffe and UFF-specific topics in the Developer Guide have been moved to the Appendix section until removal in the subsequent major release.
Interface functions that provided a destroy function are deprecated in TensorRT 8.0. The destructors will be exposed publicly in order for the delete operator to work as expected on these classes.
nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION is deprecated. Networks that have QuantizeLayer and DequantizeLayer layers will be automatically processed using Q/DQ-processing, which includes explicit-precision semantics. Explicit precision is a network-optimizer constraint that prevents the optimizer from performing precision-conversions that are not dictated by the semantics of the network. For more information, refer to the Working With QAT Networks section in the TensorRT Developer Guide.
nvinfer1::IResizeLayer::setAlignCorners and nvinfer1::IResizeLayer::getAlignCorners are deprecated. Use nvinfer1::IResizeLayer::setCoordinateTransformation, nvinfer1::IResizeLayer::setSelectorForSinglePixel and nvinfer1::IResizeLayer::setNearestRounding instead.
Destructors for classes with destroy() methods were previously protected. They are now public, enabling use of smart pointers for these classes. The destroy() methods are deprecated.
The CgPersistentLSTMPlugin_TRT plugin is deprecated.
sampleMovieLens and sampleMovieLensMPS have been removed from the TensorRT package.
The following C++ API functions, types, and a field, which were previously deprecated, were removed:
Core Library:
- DimensionType
- Dims::Type
- class DimsCHW
- class DimsNCHW
- class IOutputDimensionFormula
- class IPlugin
- class IPluginFactory
- class IPluginLayer
- class IRNNLayer
- IBuilder::getEngineCapability()
- IBuilder::allowGPUFallback()
- IBuilder::buildCudaEngine()
- IBuilder::canRunOnDLA()
- IBuilder::createNetwork()
- IBuilder::getAverageFindIterations()
- IBuilder::getDebugSync()
- IBuilder::getDefaultDeviceType()
- IBuilder::getDeviceType()
- IBuilder::getDLACore()
- IBuilder::getFp16Mode()
- IBuilder::getHalf2Mode()
- IBuilder::getInt8Mode()
- IBuilder::getMaxWorkspaceSize()
- IBuilder::getMinFindIterations()
- IBuilder::getRefittable()
- IBuilder::getStrictTypeConstraints()
- IBuilder::isDeviceTypeSet()
- IBuilder::reset()
- IBuilder::resetDeviceType()
- IBuilder::setAverageFindIterations()
- IBuilder::setDebugSync()
- IBuilder::setDefaultDeviceType()
- IBuilder::setDeviceType()
- IBuilder::setDLACore()
- IBuilder::setEngineCapability()
- IBuilder::setFp16Mode()
- IBuilder::setHalf2Mode()
- IBuilder::setInt8Calibrator()
- IBuilder::setInt8Mode()
- IBuilder::setMaxWorkspaceSize()
- IBuilder::setMinFindIterations()
- IBuilder::setRefittable()
- IBuilder::setStrictTypeConstraints()
- ICudaEngine::getWorkspaceSize()
- IMatrixMultiplyLayer::getTranspose()
- IMatrixMultiplyLayer::setTranspose()
- INetworkDefinition::addMatrixMultiply()
- INetworkDefinition::addPlugin()
- INetworkDefinition::addPluginExt()
- INetworkDefinition::addRNN()
- INetworkDefinition::getConvolutionOutputDimensionsFormula()
- INetworkDefinition::getDeconvolutionOutputDimensionsFormula()
- INetworkDefinition::getPoolingOutputDimensionsFormula()
- INetworkDefinition::setConvolutionOutputDimensionsFormula()
- INetworkDefinition::setDeconvolutionOutputDimensionsFormula()
- INetworkDefinition::setPoolingOutputDimensionsFormula()
- ITensor::getDynamicRange()
- TensorFormat::kNHWC8
- TensorFormat::NCHW
- TensorFormat::kNC2HW2
Plugins: The following plugin classes were removed:
- class INvPlugin
- createLReLUPlugin()
- createClipPlugin()
- PluginType
- struct SoftmaxTree
Plugin interface methods: For plugins based on IPluginV2DynamicExt and IPluginV2IOExt, certain methods with legacy function signatures (derived from IPluginV2 and IPluginV2Ext base classes) which were deprecated and marked for removal in TensorRT 8.0 will no longer be available. Plugins using these interface methods must stop using them or implement the versions with updated signatures, as applicable.
Unsupported plugin methods removed in TensorRT 8.0:
- IPluginV2DynamicExt::canBroadcastInputAcrossBatch()
- IPluginV2DynamicExt::isOutputBroadcastAcrossBatch()
- IPluginV2DynamicExt::getTensorRTVersion()
- IPluginV2IOExt::configureWithFormat()
- IPluginV2IOExt::getTensorRTVersion()
Use updated versions for supported plugin methods:
- IPluginV2DynamicExt::configurePlugin()
- IPluginV2DynamicExt::enqueue()
- IPluginV2DynamicExt::getOutputDimensions()
- IPluginV2DynamicExt::getWorkspaceSize()
- IPluginV2IOExt::configurePlugin()
Use newer methods for the following:
- IPluginV2DynamicExt::supportsFormat() has been removed,use IPluginV2DynamicExt::supportsFormatCombination() instead.
- IPluginV2IOExt::supportsFormat() has been removed,use IPluginV2IOExt::supportsFormatCombination() instead.
Caffe Parser:
- class IPluginFactory
- class IPluginFactoryExt
- setPluginFactory()
- setPluginFactoryExt()
UFF Parser:
- class IPluginFactory
- class IPluginFactoryExt
- setPluginFactory()
- setPluginFactoryExt()
The following Python API functions, which were previously deprecated, were removed:
Core library:
- class DimsCHW
- class DimsNCHW
- class IPlugin
- class IPluginFactory
- class IPluginLayer
- class IRNNLayer
- Builder.build_cuda_engine()
- Builder.average_find_iterations
- Builder.debug_sync
- Builder.fp16_mode
- IBuilder.int8_mode
- Builder.max_workspace_size
- Builder.min_find_iterations
- Builder.refittable
- Builder.strict_type_constraints
- ICudaEngine.max_workspace_size
- IMatrixMultiplyLayer.transpose0
- IMatrixMultiplyLayer.transpose0
- INetworkDefinition.add_matrix_multiply_deprecated()
- INetworkDefinition.add_plugin()
- INetworkDefinition.add_plugin_ext()
- INetworkDefinition.add_rnn()
- INetworkDefinition.convolution_output_dimensions_formula
- INetworkDefinition.deconvolution_output_dimensions_formula
- INetworkDefinition.pooling_output_dimensions_formula
- ITensor.get_dynamic_range()
- Dims.get_type()
- TensorFormat.HWC8
- TensorFormat.NCHW
- TensorFormat.NCHW2
Caffe Parser:
- class IPluginFactory
- class IPluginFactoryExt
- setPluginFactory()
- setPluginFactoryExt()
UFF Parser:
- class IPluginFactory
- class IPluginFactoryExt
- setPluginFactory()
- setPluginFactoryExt()
Plugins:
- class INvPlugin
- createLReLUPlugin()
- createClipPlugin()
- PluginType
- struct SoftmaxTree
The following Python API functions were removed:
Core library:
- class DimsCHW
- class DimsNCHW
- class IPlugin
- class IPluginFactory
- class IPluginLayer
- class IRNNLayer
- Builder.build_cuda_engine()
- Builder.average_find_iterations
- Builder.debug_sync
- Builder.fp16_mode
- IBuilder.int8_mode
- Builder.max_workspace_size
- Builder.min_find_iterations
- Builder.refittable
- Builder.strict_type_constraints
- ICudaEngine.max_workspace_size
- IMatrixMultiplyLayer.transpose0
- IMatrixMultiplyLayer.transpose0
- INetworkDefinition.add_matrix_multiply_deprecated()
- INetworkDefinition.add_plugin()
- INetworkDefinition.add_plugin_ext()
- INetworkDefinition.add_rnn()
- INetworkDefinition.convolution_output_dimensions_formula
- INetworkDefinition.deconvolution_output_dimensions_formula
- INetworkDefinition.pooling_output_dimensions_formula
- ITensor.get_dynamic_range()
- Dims.get_type()
- TensorFormat.HWC8
- TensorFormat.NCHW
- TensorFormat.NCHW2
Caffe Parser:
- class IPluginFactory
- class IPluginFactoryExt
- CaffeParser.plugin_factory
- CaffeParser.plugin_factory_ext
UFF Parser:
- class IPluginFactory
- class IPluginFactoryExt
- UffParser.plugin_factory
- UffParser.plugin_factory_ext

Fixed Issues

The diagram in IRNNv2Layer was incorrect. The diagram has been updated and fixed.
Improved build times for convolution layers with dynamic shapes and large range of leading dimensions.
TensorRT 8.0 no longer requires libcublas.so.* to be present on your system when running an application which was linked with the TensorRT static library. The TensorRT static library now requires cuBLAS and other dependencies to be linked at link time and will no longer open these libraries using dlopen().
TensorRT 8.0 no longer requires an extra Identity layer between the ElementWise and the Constant whose rank is > 4. For TensorRT 7.x versions, cases like Convolution and FullyConnected with bias where ONNX decomposes the bias to ElementWise, there was a fusion which didn’t support per element scale. We previously inserted an Identity to workaround this.
There was a known performance regression compared to TensorRT 7.1 for Convolution layers with kernel size greater than 5x5. For example, it could lead up to 35% performance regression of the VGG16 UFF model compared to TensorRT 7.1. This issue has been fixed in this release.
When running networks such as Cortana, LSTM Peephole, MLP, and Faster RCNN, there was a 5% to 16% performance regression on GA102 devices and a 7% to 36% performance regression on GA104 devices. This issue has been fixed in this release. (not applicable for Jetson platforms)
Some RNN networks such as Cortana with FP32 precision and batch size of 8 or higher had a 20% performance loss with CUDA 11.0 or higher compared to CUDA 10.2. This issue has been fixed in this release.
There was an issue when compiling the TensorRT samples with a GCC version less than 5.x and using the static libraries which resulted in the error message munmap_chunk(): invalid pointer. RHEL/CentOS 7.x users were most likely to have observed this issue. This issue has been fixed in this release.
cuTENSOR, used by TensorRT 8.0 EA, was known to have significant performance regressions with the CUDA 11.3 compiler. This regression has been fixed by the CUDA 11.3 update 1 compiler.
The installation of PyTorch Quantization Toolkit requires Python version >=3.7, GCC version >=5.4. The specific version of Python may be missing from some operating systems and will need to be separately installed. Refer to the README instructions for the workaround.
On platforms with Python >= 3.8, TensorFlow 1.x must be installed from the NVIDIA Python package index. For example:
```
pip install --extra-index-url https://pypi.ngc.nvidia.com nvidia-tensorflow;
        python_version==3.8
```
There is an up to 15% performance regression compared to TensorRT 7.2.3 for QuartzNet variants on Volta GPUs.
MNIST images used by the samples previously had to be downloaded manually. These images are now shipped with the samples.
You may observe relocation issues during linking if the resulting binary exceeds 2 GB. This can occur if you are linking TensorRT and all of its dependencies into your application statically. A workaround for this linking issue has been documented in the TensorRT Sample Support Guide under Limitations.
IProfiler would not correctly call user-implemented methods when used from the Python API. This issue has been fixed in this release.
TensorRT memory usage has improved and can be better managed via IGpuAllocator::reallocate when more memory is required.
TensorRT refitting performance has been improved, especially for large weights and when multiple weights are refitted at the same time. Refitting performance will continue to be optimized in later releases.
The interfaces that took an argument of type void** (for example, enqueueV2) now declare it as void*const*.
There was an up to 24% performance regression in TensorRT 8.0.0 compared to TensorRT 7.2.3 for networks containing Slice layers on Turing GPUs. This issue has been fixed.
There was an up to 8% performance regression in TensorRT 8.0.0 compared to TensorRT 7.2.3 for DenseNet variants on Volta GPUs. This issue has been fixed in this release.
If input tensors with dynamic shapes were found to be inconsistent with the selected optimization profile during engine building or during inference, an error message is issued with graceful program exit instead of assertion failure and abnormal exit.
When running TensorRT 8.0.0 with cuDNN 8.2.0, there is a known performance regression for the deconvolution layer compared to running with previous cuDNN releases. For example, some deconvolution layers can have up to 7x performance regression on Turing GPUs compared to running with cuDNN 8.0.4. This has been fixed in the latest cuDNN 8.2.1 release.

Announcements

TensorRT 8.0 will be the last TensorRT release that will provide support for Ubuntu 16.04. This also means TensorRT 8.0 will be the last TensorRT release that will support Python 3.5.
Python samples use a unified data downloading workflow. Each sample has a YAML (download.yml) describing the data files that are required to download via a link before running the sample, if any. The download tool parses the YAML and downloads the data files. All other sample code assumes that the data has been downloaded before the code is invoked. An error will be raised if the data is not correctly downloaded. Refer to the Python sample documentation for more information.

Known Issues

The TensorRT ARM SBSA cross packages in the CUDA network repository cannot be installed because cuDNN ARM SBSA cross packages are not available, which is a dependency of the TensorRT cross packages. The TensorRT ARM SBSA cross packages may be removed in the near future. You should use the native TensorRT ARM SBSA packages instead.
There is a known issue that graph capture may fail in some cases for IExecutionContext::enqueue() and IExecutionContext::enqueueV2(). For more information, refer to the documentation for IExecutionContext::enqueueV2(), including how to work around this issue.
On PowerPC, some RNN networks have up to a 15% performance regression compared to TensorRT 7.0. (not applicable for Jetson platforms)
Some fusions are not enabled when the TensorRT static library is used. This means there is a performance loss of around 10% for networks like BERT and YOLO3 when linking with the static library compared to the dynamic library. The performance loss depends on precision used and batch size and it can be up to 60% in some cases.
The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit those weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().
The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.
There is a known performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
- up to 12% in FP32 mode. This will be fixed in a future release.
- up to 10% in FP16 mode on Maxwell and Pascal GPUs.
There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation like VGG16 on Nano.
There is a known issue that TensorRT selects kLINEAR format when the user uses reformat-free I/O with vectorized formats and with input/output tensors which have only 3 dimensions. The workaround is to add an additional dimension to the tensors with size 1 to make them 4 dimensional tensors.
As DLA Deconvolution layers with square kernels and strides between 23 and 32 significantly slow down compilation time, they are disabled by TensorRT to run on DLA.
There are some known false alarms reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the false alarms is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool.
```
{
  Memory leak errors with dlopen.
   Memcheck:Leak
   match-leak-kinds: definite
   ...
   fun:*dlopen*
   ...
}

{
 
  Tegra ioctl false alarm
  Memcheck:Param
  ioctl(TCGETA)
  fun:ioctl
  ...
  obj:*libnvrm_gpu.so*
  ...
  obj:*libcuda.so*
}
```
The suppression file can resolve the false alarms about definite loss related to dlopen() and ioctl() definite loss on the Tegra platform. The other false alarm which can not be added to the suppression file is a sole malloc() call without any call stack.
There is an up to 150% performance regression compared to TensorRT 7.2.3 for 3D U-Net variants on NVIDIA Ampere GPUs, if the optimal algorithm choice is constrained by the available workspace. To work around this issue, enlarge the workspace size. (not applicable for Jetson platforms)
PluginFieldCollection in the Python API may prematurely deallocate PluginFields. To work around this, assign the list of plugin fields to a named variable:
```
plugin_fields = [trt.PluginField(...), ...]
plugin_field_collection = trt.PluginFieldCollection(plugin_fields)
```
The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS. (not applicable for Jetson platforms)
On Jetson devices, the power consumption may increase for the sake of performance improvement when compared against TensorRT 7.1. No significant drop in the performance per watt has been observed.
There is an up to 15% performance regression compared to TensorRT 7.2.3 for path perception network (Pathnet) in FP32.
There is an up to 10-11% performance regression on Xavier:
- compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
- compared to TensorRT 6 for ResNeXt networks with small batch (1 or 2) in FP32.
For networks that use deconv with large kernel size, the engine build time could drop a lot for this layer on Xavier. It could also lead to the launch timed out and was terminated error message on Jetson Nano/TX1.
For some networks with large amounts of weights and activation data, DLA may fail compiling the subgraph, and that subgraph will fallback to GPU.
There is an up to 10% performance regression when TensorRT is used with cuDNN 8.1 or 8.2. When cuDNN 8.0 is used, the performance is recovered. (not applicable for Jetson platforms)
There is an up to 6% performance regression compared to TensorRT 7.2.3 for WaveRNN in FP16 on Volta and Turing platforms.
On embedded devices, TensorRT attempts to avoid testing kernel candidates whose memory requirements would trigger the Out of Memory (OOM) killer. If it does trigger, consider reducing the memory requirement for the model by reducing index dimensions, or maximize the available memory by closing other applications.
There is a known accuracy issue of GoogLeNet variants with NVIDIA Ampere GPUs where TF32 mode is enabled by default on windows. (not applicable for Jetson platforms)
There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet with CUDA 11.3 on P100 and V100. When CUDA 11.0 is used, the regression is recovered. (not applicable for Jetson platforms)
There is an up to 10% performance regression compared to TensorRT 7.2.3 in JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges of the inputs of the ElementWise ADD layers are different. This is due to a fix for a bug in DLA where it ignored the dynamic range of the second input of the ElementWise ADD layers and caused some accuracy issues.
There is a known 4% accuracy regression with Faster R-CNN NasNet network with NVIDIA Ampere and Turing GPUs. (not applicable for Jetson platforms)
Under some conditions, RNNv2Layer can require a larger workspace size than previous versions of TensorRT in order to run all supported tactics. Consider increasing the workspace size to work around this issue.
Engine build times for TensorRT 8.0 may be slower than TensorRT 7.2 due to the engine optimizer being more aggressive.
There is an up to 30% performance regression with QAT (quantization-aware-training) EfficientNet networks on V100 compared to TensorRT 7.2. (not applicable for Jetson platforms)
The new Python sample efficientdet is only available in the OSS release and will be added in the core package in the next release.

TensorRT Release 8.0.0 Early Access (EA)

This is the TensorRT 8.0.0 Early Access (EA) release notes and is applicable to Linux x86 users.

These release notes are applicable to workstation, server, and JetPack users unless appended specifically with (not applicable for Jetson platforms).

This release includes several fixes from the previous TensorRT 7.x.x release as well as the following additional changes. For previous TensorRT documentation, see the TensorRT Archived Documentation.

Key Features And Enhancements

This TensorRT release includes the following key features and enhancements.

Added support for RedHat/CentOS 8.3, Ubuntu 20.04, and SUSE Linux Enterprise Server 15 Linux distributions. Only a tar file installation is supported on SLES 15 at this time. For more information, refer to the TensorRT Installation Guide.
Added Python 3.9 support. Use a tar file installation to obtain the new Python wheel files. For more information, refer to the TensorRT Installation Guide.
Added ResizeCoordinateTransformation, ResizeSelector, and ResizeRoundMode; three new enumerations to IResizeLayer, and enhanced IResizeLayer to support more resize modes from TensorFlow, PyTorch, and ONNX. For more information, refer to the IResizeLayer section in the TensorRT Developer Guide.
Builder timing cache can be serialized and reused across builder instances. For more information, refer to the Builder Layer Timing Cache and trtexec sections in the TensorRT Developer Guide.
Added convolution and fully-connected tactics which support and make use of structured sparsity in kernel weights. This feature can be enabled by setting the kSPARSE_WEIGHTS flag in IBuilderConfig. This feature is only available on NVIDIA Ampere GPUs. For more information, refer to the Structured Sparsity section in the Best Practices For TensorRT Performance guide. (not applicable for Jetson platforms)
Added two new layers to the API: IQuantizeLayer and IDequantizeLayer which can be used to explicitly specify the precision of operations and data buffers. ONNX’s QuantizeLinear and DequantizeLinear operators are mapped to these new layers which enables the support for networks trained using Quantization-Aware Training (QAT) methodology. For more information, refer to the Explicit-Quantization, IQuantizeLayer, and IDequantizeLayer sections in the TensorRT Developer Guide and Q/DQ Fusion in the Best Practices For TensorRT Performance guide.
Achieved QuartzNet optimization with support of 1D fused depthwise + pointwise convolution kernel to achieve up to 1.8x end-to-end performance improvement on A100. (not applicable for Jetson platforms)
Added support for the following ONNX operators: Celu, CumSum, EyeLike, GatherElements, GlobalLpPool, GreaterOrEqual, LessOrEqual, LpNormalization, LpPool, ReverseSequence, and SoftmaxCrossEntropyLoss. For more information, refer to the Supported Ops section in the TensorRT Support Matrix.
Added Sigmoid/Tanh INT8 support for DLA. It allows DLA sub-graph with Sigmoid/Tanh to compile with INT8 by auto-upgrade to FP16 internally. For more information, refer to the DLA Supported Layers section in the TensorRT Developer Guide.
Added DLA native planar format and DLA native gray-scale format support.
Allow to generate reformat-free engine with DLA when EngineCapability is EngineCapability::kDEFAULT.
TensorRT now declares API’s with the noexcept keyword to clarify that exceptions must not cross the library boundary. All TensorRT classes that an application inherits from (such as IGpuAllocator, IPluginV2, etc…) must guarantee that methods called by TensorRT do not throw uncaught exceptions, or the behavior is undefined.
TensorRT reports errors, along with an associated ErrorCode, via the ErrorRecorder API for all errors. The ErrorRecorder will fallback to the legacy logger reporting, with Severity::kERROR or Severity::kINTERNAL_ERROR, if no error recorder is registered. The ErrorCodes allow recovery in cases where TensorRT previously reported non-recoverable situations.
Improved performance of the GlobalAveragePooling operation, which is used in some CNNs like EfficientNet. For transformer based networks with INT8 precision, it’s recommended to use a network which is trained using Quantization Aware Training (QAT) and has IQuantizeLayer and IDequantizeLayer layers in the network definition.
TensorRT now supports refit weights via names. For more information, refer to Refitting An Engine in the TensorRT Developer Guide.
Refitting performance has been improved. The performance boost can be evident when the weights are large or a large number of weights or layers are updated at the same time.
Added a new sample.This sample, engine_refit_onnx_bidaf, builds an engine from the ONNX BiDAF model, and refits the TensorRT engine with weights from the model. The new refit APIs allow users to locate the weights via names from ONNX models instead of layer names and weights roles. For more information, refer to the Refitting An Engine Built From An ONNX Model In Python in the TensorRT Sample Support Guide.
Improved performance for the transformer based networks such as BERT and other networks that use Multi-Head Self-Attention.
Added cuDNN to the IBuilderConfig::setTacticSources enum. Use of cuDNN as a source of operator implementations can be enabled or disabled using the IBuilderConfig::setTacticSources API function.
The following C++ API functions were added:
- class IDequanzizeLayer
- class IQuantizeLayer
- class ITimingCache
- IBuilder::buildSerializedNetwork()
- IBuilderConfig::getTimingCache()
- IBuilderConfig::setTimingCache()
- IGpuAllocator::reallocate()
- INetworkDefinition::addDequantize()
- INetworkDefinition::addQuantize()
- INetworkDefinition::setWeightsName()
- IPluginRegistry::deregisterCreator()
- IRefitter::getMissingWeights()
- IRefitter::getAllWeights()
- IRefitter::setNamedWeights()
- IResizeLayer::getCoordinateTransformation()
- IResizeLayer::getNearestRounding()
- IResizeLayer::getSelectorForSinglePixel()
- IResizeLayer::setCoordinateTransformation()
- IResizeLayer::setNearestRounding()
- IResizeLayer::setSelectorForSinglePixel()
- IScaleLayer::setChannelAxis()
- enum ResizeCoordinateTransformation
- enum ResizeMode
- BuilderFlag::kSPARSE_WEIGHTS
- TacticSource::kCUDNN
- TensorFormat::kDLA_HWC4
- TensorFormat::kDLA_LINEAR
- TensorFormat::kHWC16
The following Python API functions were added:
- class IDequanzizeLayer
- class IQuantizeLayer
- class ITimingCache
- Builder.build_serialized_network()
- IBuilderConfig.get_timing_cache()
- IBuilderConfig.set_timing_cache()
- IGpuAllocator.reallocate()
- INetworkDefinition.add_dequantize()
- INetworkDefinition.add_quantize()
- INetworkDefinition.set_weights_name()
- IPluginRegistry.deregister_creator()
- IRefitter.get_missing_weights()
- IRefitter.get_all_weights()
- IRefitter::set_named_weights()
- IResizeLayer.coordinate_transformation
- IResizeLayer.nearest_rounding
- IResizeLayer.selector_for_single_pixel
- IScaleLayer.channel_axis
- enum ResizeCoordinateTransformation
- enum ResizeMode
- BuilderFlag.SPARSE_WEIGHTS
- TacticSource.CUDNN
- TensorFormat.DLA_HWC4
- TensorFormat.DLA_LINEAR
- TensorFormat.HWC16

Breaking API Changes

Support for Python 2 has been dropped. This means that TensorRT will no longer include wheels for Python 2, and Python samples will not work with Python 2.
All API's have been marked as noexcept where appropriate. The IErrorRecorder interface has been fully integrated into the API for error reporting. The Logger is only used as a fallback when the ErrorRecorder is not provided by the user.
Callback changes are now marked noexcept, therefore, implementations must also be marked noexcept. TensorRT has never catered to exceptions thrown by callbacks, but this is now captured in the API.
Methods that take parameters of type void** where the array of pointers is unmodifiable are now changed to take type void*const*.
Dims is now a type alias for class Dims32. Code that forward-declares Dims should forward-declare class Dims32; using Dims = Dims32;.

Compatibility

TensorRT 8.0.0 EA has been tested with the following:
This TensorRT release supports CUDA:
Note: There are two TensorRT binary builds for CUDA 11.0 and CUDA 11.3. The build for CUDA 11.3 is compatible with CUDA 11.1 and CUDA 11.2 libraries. For both builds, CUDA driver compatible with the runtime CUDA version is required (see Table 2 here). For the CUDA 11.3 build, driver version 465 or above is suggested for best performance.
It is suggested that you use TensorRT with a software stack that has been tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And Software section. Other semantically compatible releases of cuDNN and cuBLAS can be used, however, other versions may have performance improvements as well as regressions. In rare cases, functional regressions might also be observed.

Limitations

For QAT networks, TensorRT 8.0 supports per-tensor and per-axis quantization scales for weights. For activations, only per-tensor quantization is supported. Only symmetric quantization is supported and zero-point weights may be omitted or, if zero-points are provided, all coefficients must have a value of zero.
Loops and DataType::kBOOL are not supported when the static TensorRT library is used. Performance improvements for transformer based architectures such as BERT will also not be available when using static TensorRT library.
When using reformat-free I/O, the extent of a tensor in a vectorized dimension might not be a multiple of the vector length. Elements in a partially occupied vector that are not within the tensor are referred to here as vector-padding. For example:
- On GPU
  - for input tensors, the application shall set vector-padding elements to zero.
  - for output tensors, the value of vector-padding elements is undefined. In a future release, TensorRT will support setting them to zero.
- On DLA
  - for input tensors, vector-padding elements are ignored.
  - for output tensors, vector-padding elements are unmodified.
When running INT8 networks on DLA using TensorRT, operations must be added to the same subgraph to reduce quantization errors across the subgraph of the network that runs on the DLA by allowing them to fuse and retain higher precision for intermediate results. Breaking apart the subgraph in order to inspect intermediate results by setting the tensors as network output tensors, can result in different levels of quantization errors due to these optimizations being disabled.
If both kSPARSE_WEIGHTS and kREFIT flags are set in IBuilderConfig, the convolution layers having structured sparse kernel weights cannot be refitted with new kernel weights which do not have structured sparsity. The IRefitter::setWeights() will print an error and return false in that case.

Deprecated And Removed Features

The following features are deprecated in TensorRT 8.0.0:

Deprecation is used to inform developers that some APIs and tools are no longer recommended for use. TensorRT has the following deprecation policy:
- This policy comes into effect beginning with TensorRT 8.0.
- Deprecation notices are communicated in the release notes. Deprecated API elements are marked with the TRT_DEPRECATED macro where possible.
- TensorRT provides a 12-month migration period after the deprecation. For any APIs and tools deprecated in TensorRT 7.x, the 12-month migration period starts from the TensorRT 8.0 GA release date.
- APIs and tools will continue to work during the migration period.
- After the migration period ends, we reserve the right to remove the APIs and tools in a future release.
IRNNLayer was deprecated in TensorRT 4.0 and has been removed in TensorRT 8.0. IRNNv2Layer was deprecated in TensorRT 7.2.1. IRNNv2Layer has been deprecated in favor of the loop API, however, it is still available for backwards compatibility. For more information about the loop API, refer to the sampleCharRNN sample with the --Iloop option as well as the Working With Loops chapter.
IPlugin and IPluginFactory interfaces were deprecated in TensorRT 6.0 and have been removed in TensorRT 8.0. We recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt and IPluginV2IOExt interfaces. For more information, refer to the Migrating Plugins From TensorRT 6.x Or 7.x To TensorRT 8.x.x section.
We removed samplePlugin since it was meant to demonstrate the IPluginExt interface, which is no longer supported in TensorRT 8.0.
We have deprecated the Caffe Parser and UFF Parser in TensorRT 7.0. They are still tested and functional in TensorRT 8.0, however, we plan to remove the support in the future. Ensure you migrate your workflow to use tf2onnx, keras2onnx or TensorFlow-TensorRT (TF-TRT) for deployment.
If using UFF, ensure you migrate to the ONNX workflow through enablement of a plugin. ONNX workflow is not dependent on plugin enablement. For plugin enablement of a plugin on ONNX, refer to Estimating Depth with ONNX Models and Custom Layers Using NVIDIA TensorRT.
- For TensorFlow to ONNX and then to TensorRT, refer to Speeding up Deep Learning Inference Using TensorFlow, ONNX, and TensorRT.
- For PyTorch to ONNX and then to TensorRT, refer to Speeding up Deep Learning Inference Using TensorRT.
Caffe and UFF-specific topics in the Developer Guide have been moved to the Appendix section until removal in the subsequent major release.
Interface functions that provided a destroy function are deprecated in TensorRT 8.0. The destructors will be exposed publicly in order for the delete operator to work as expected on these classes.
nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION is deprecated. Networks that have QuantizeLayer and DequantizeLayer layers will be automatically processed using Q/DQ-processing, which includes explicit-precision semantics. Explicit precision is a network-optimizer constraint that prevents the optimizer from performing precision-conversions that are not dictated by the semantics of the network. For more information, refer to the Working With QAT Networks section in the TensorRT Developer Guide.
nvinfer1::IResizeLayer::setAlignCorners and nvinfer1::IResizeLayer::getAlignCorners are deprecated. Use nvinfer1::IResizeLayer::setCoordinateTransformation, nvinfer1::IResizeLayer::setSelectorForSinglePixel and nvinfer1::IResizeLayer::setNearestRounding instead.
Destructors for classes with destroy() methods were previously protected. They are now public, enabling use of smart pointers for these classes. The destroy() methods are deprecated.
The following C++ API functions, types, and a field, which were previously deprecated, were removed:
Core Library:
- DimensionType
- Dims::Type
- class DimsCHW
- class DimsNCHW
- class IOutputDimensionFormula
- class IPlugin
- class IPluginFactory
- class IPluginLayer
- class IRNNLayer
- IBuilder::getEngineCapability()
- IBuilder::allowGPUFallback()
- IBuilder::buildCudaEngine()
- IBuilder::canRunOnDLA()
- IBuilder::createNetwork()
- IBuilder::getAverageFindIterations()
- IBuilder::getDebugSync()
- IBuilder::getDefaultDeviceType()
- IBuilder::getDeviceType()
- IBuilder::getDLACore()
- IBuilder::getFp16Mode()
- IBuilder::getHalf2Mode()
- IBuilder::getInt8Mode()
- IBuilder::getMaxWorkspaceSize()
- IBuilder::getMinFindIterations()
- IBuilder::getRefittable()
- IBuilder::getStrictTypeConstraints()
- IBuilder::isDeviceTypeSet()
- IBuilder::reset()
- IBuilder::resetDeviceType()
- IBuilder::setAverageFindIterations()
- IBuilder::setDebugSync()
- IBuilder::setDefaultDeviceType()
- IBuilder::setDeviceType()
- IBuilder::setDLACore()
- IBuilder::setEngineCapability()
- IBuilder::setFp16Mode()
- IBuilder::setHalf2Mode()
- IBuilder::setInt8Calibrator()
- IBuilder::setInt8Mode()
- IBuilder::setMaxWorkspaceSize()
- IBuilder::setMinFindIterations()
- IBuilder::setRefittable()
- IBuilder::setStrictTypeConstraints()
- ICudaEngine::getWorkspaceSize()
- IMatrixMultiplyLayer::getTranspose()
- IMatrixMultiplyLayer::setTranspose()
- INetworkDefinition::addMatrixMultiply()
- INetworkDefinition::addPlugin()
- INetworkDefinition::addPluginExt()
- INetworkDefinition::addRNN()
- INetworkDefinition::getConvolutionOutputDimensionsFormula()
- INetworkDefinition::getDeconvolutionOutputDimensionsFormula()
- INetworkDefinition::getPoolingOutputDimensionsFormula()
- INetworkDefinition::setConvolutionOutputDimensionsFormula()
- INetworkDefinition::setDeconvolutionOutputDimensionsFormula()
- INetworkDefinition::setPoolingOutputDimensionsFormula()
- ITensor::getDynamicRange()
- TensorFormat::kNHWC8
- TensorFormat::NCHW
- TensorFormat::kNC2HW2
Plugins: The following plugin classes were removed:
- class INvPlugin
- createLReLUPlugin()
- createClipPlugin()
- PluginType
- struct SoftmaxTree
Plugin interface methods: For plugins based on IPluginV2DynamicExt and IPluginV2IOExt, certain methods with legacy function signatures (derived from IPluginV2 and IPluginV2Ext base classes) which were deprecated and marked for removal in TensorRT 8.0 will no longer be available. Plugins using these interface methods must stop using them or implement the versions with updated signatures, as applicable.
Unsupported plugin methods removed in TensorRT 8.0:
- IPluginV2DynamicExt::canBroadcastInputAcrossBatch()
- IPluginV2DynamicExt::isOutputBroadcastAcrossBatch()
- IPluginV2DynamicExt::getTensorRTVersion()
- IPluginV2IOExt::configureWithFormat()
- IPluginV2IOExt::getTensorRTVersion()
Use updated versions for supported plugin methods:
- IPluginV2DynamicExt::configurePlugin()
- IPluginV2DynamicExt::enqueue()
- IPluginV2DynamicExt::getOutputDimensions()
- IPluginV2DynamicExt::getWorkspaceSize()
- IPluginV2IOExt::configurePlugin()
Use newer methods for the following:
- IPluginV2DynamicExt::supportsFormat() has been removed,use IPluginV2DynamicExt::supportsFormatCombination() instead.
- IPluginV2IOExt::supportsFormat() has been removed,use IPluginV2IOExt::supportsFormatCombination() instead.
Caffe Parser:
- class IPluginFactory
- class IPluginFactoryExt
- setPluginFactory()
- setPluginFactoryExt()
UFF Parser:
- class IPluginFactory
- class IPluginFactoryExt
- setPluginFactory()
- setPluginFactoryExt()
The following Python API functions, which were previously deprecated, were removed:
Core library:
- class DimsCHW
- class DimsNCHW
- class IPlugin
- class IPluginFactory
- class IPluginLayer
- class IRNNLayer
- Builder.build_cuda_engine()
- Builder.average_find_iterations
- Builder.debug_sync
- Builder.fp16_mode
- IBuilder.int8_mode
- Builder.max_workspace_size
- Builder.min_find_iterations
- Builder.refittable
- Builder.strict_type_constraints
- ICudaEngine.max_workspace_size
- IMatrixMultiplyLayer.transpose0
- IMatrixMultiplyLayer.transpose0
- INetworkDefinition.add_matrix_multiply_deprecated()
- INetworkDefinition.add_plugin()
- INetworkDefinition.add_plugin_ext()
- INetworkDefinition.add_rnn()
- INetworkDefinition.convolution_output_dimensions_formula
- INetworkDefinition.deconvolution_output_dimensions_formula
- INetworkDefinition.pooling_output_dimensions_formula
- ITensor.get_dynamic_range()
- Dims.get_type()
- TensorFormat.HWC8
- TensorFormat.NCHW
- TensorFormat.NCHW2
Caffe Parser:
- class IPluginFactory
- class IPluginFactoryExt
- setPluginFactory()
- setPluginFactoryExt()
UFF Parser:
- class IPluginFactory
- class IPluginFactoryExt
- setPluginFactory()
- setPluginFactoryExt()
Plugins:
- class INvPlugin
- createLReLUPlugin()
- createClipPlugin()
- PluginType
- struct SoftmaxTree
The following Python API functions were removed:
Core library:
- class DimsCHW
- class DimsNCHW
- class IPlugin
- class IPluginFactory
- class IPluginLayer
- class IRNNLayer
- Builder.build_cuda_engine()
- Builder.average_find_iterations
- Builder.debug_sync
- Builder.fp16_mode
- IBuilder.int8_mode
- Builder.max_workspace_size
- Builder.min_find_iterations
- Builder.refittable
- Builder.strict_type_constraints
- ICudaEngine.max_workspace_size
- IMatrixMultiplyLayer.transpose0
- IMatrixMultiplyLayer.transpose0
- INetworkDefinition.add_matrix_multiply_deprecated()
- INetworkDefinition.add_plugin()
- INetworkDefinition.add_plugin_ext()
- INetworkDefinition.add_rnn()
- INetworkDefinition.convolution_output_dimensions_formula
- INetworkDefinition.deconvolution_output_dimensions_formula
- INetworkDefinition.pooling_output_dimensions_formula
- ITensor.get_dynamic_range()
- Dims.get_type()
- TensorFormat.HWC8
- TensorFormat.NCHW
- TensorFormat.NCHW2
Caffe Parser:
- class IPluginFactory
- class IPluginFactoryExt
- CaffeParser.plugin_factory
- CaffeParser.plugin_factory_ext
UFF Parser:
- class IPluginFactory
- class IPluginFactoryExt
- UffParser.plugin_factory
- UffParser.plugin_factory_ext

Fixed Issues

Improved build times for convolution layers with dynamic shapes and large range of leading dimensions.
TensorRT 8.0 no longer requires libcublas.so.* to be present on your system when running an application which was linked with the TensorRT static library. The TensorRT static library now requires cuBLAS and other dependencies to be linked at link time and will no longer open these libraries using dlopen().
TensorRT 8.0 no longer requires an extra Identity layer between the ElementWise and the Constant whose rank is > 4. For TensorRT 7.x versions, cases like Convolution and FullyConnected with bias where ONNX decomposes the bias to ElementWise, there was a fusion which didn’t support per element scale. We previously inserted an Identity to workaround this.
There was a known performance regression compared to TensorRT 7.1 for Convolution layers with kernel size greater than 5x5. For example, it could lead up to 35% performance regression of the VGG16 UFF model compared to TensorRT 7.1. This issue has been fixed in this release.
When running networks such as Cortana, LSTM Peephole, MLP, and Faster RCNN, there was a 5% to 16% performance regression on GA102 devices and a 7% to 36% performance regression on GA104 devices. This issue has been fixed in this release. (not applicable for Jetson platforms)
Some RNN networks such as Cortana with FP32 precision and batch size of 8 or higher had a 20% performance loss with CUDA 11.0 or higher compared to CUDA 10.2. This issue has been fixed in this release.

Announcements

TensorRT 8.0 will be the last TensorRT release that will provide support for Ubuntu 16.04. This also means TensorRT 8.0 will be the last TensorRT release that will support Python 3.5.
Python samples use a unified data downloading workflow. Each sample has a YAML (download.yml) describing the data files that are required to download via a link before running the sample, if any. The download tool parses the YAML and downloads the data files. All other sample code assumes that the data has been downloaded before the code is invoked. An error will be raised if the data is not correctly downloaded. Refer to the Python sample documentation for more information.

Known Issues

The diagram in IRNNv2Layer is incorrect. This will be fixed in a future release.
There is a known issue that graph capture may fail in some cases for IExecutionContext::enqueue() and IExecutionContext::enqueueV2(). For more information, refer to the documentation for IExecutionContext::enqueueV2(), including how to work around this issue.
Some fusions are not enabled when the TensorRT static library is used. This means there is a performance loss of around 10% for networks like BERT and YOLO3 when linking with the static library compared to the dynamic library. The performance loss depends on precision used and batch size and it can be up to 60% in some cases.
The UFF parser generates unused IConstantLayer objects that are visible via method NetworkDefinition::getLayer but optimized away by TensorRT, so any attempt to refit those weights with IRefitter::setWeights will be rejected. Given an IConstantLayer* layer, you can detect whether it is used for execution by checking: layer->getOutput(0)->isExecutionTensor().
The ONNX parser does not support RNN, LSTM, and GRU nodes when the activation type of the forward pass does not match the activation type of the reverse pass in bidirectional cases.
There is a known performance regression compared to TensorRT 7.1 for some networks dominated by FullyConnected with activation and bias operations:
- up to 12% in FP32 mode. This will be fixed in a future release.
- up to 10% in FP16 mode on Maxwell and Pascal GPUs.
There is an up to 8% performance regression compared to TensorRT 7.1 for some networks with heavy FullyConnected operation on Nano.
There is an up to 15% performance regression compared to TensorRT 7.2.3 for QuartzNet variants on Volta GPUs.
There is an up to 150% performance regression compared to TensorRT 7.2.3 for 3D U-Net variants on NVIDIA Ampere GPUs if the workspace size is limited to 1GB. Enlarging the workspace size (for example, to 2GB) can workaround this issue.
There is a known issue that TensorRT selects kLINEAR format when the user uses reformat-free I/O with vectorized formats and with input/output tensors which have only 3 dimensions. The workaround is to add an additional dimension to the tensors with size 1 to make them 4 dimensional tensors.
CuTensor based algorithms on TensorRT 8.0 EA are known to have significant performance regressions due to an issue with the CUDA 11.3 compiler (5x-10x slower than CUDA 11.0 builds). This is due to a compiler regression and the performance should be recovered with a future CUDA release.
When running TensorRT 8.0.0 with cuDNN 8.2.0, there is a known performance regression for the deconvolution layer compared to running with previous cuDNN releases. For example, some deconvolution layers can have up to 7x performance regression on Turing GPUs compared to running with cuDNN 8.0.4. This will be fixed in a future cuDNN release.
There is a known false alarm reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommended way for suppressing the false alarm is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool.
```
{
   Ignore the dlopen false alarm.
   Memcheck:Leak
   ...
   fun:_dl_open
   ...
}
```
There is an up to 8% performance regression compared to TensorRT 7.2.3 for DenseNet variants on Volta GPUs.
There is an up to 24% performance regression compared to TensorRT 7.2.3 for networks containing Slice layers on Turing GPUs.
While using the TensorRT static library, users are still required to have the cuDNN/cuBLAS dynamic libraries installed at runtime. This issue will be resolved in the GA release so that cuDNN/cuBLAS static libraries will always be used instead.

An issue was discovered while compiling the TensorRT samples using the TensorRT static libraries with a GCC version older than 5.x. When using RHEL/CentOS 7.x, you may observe a crash with the error message munmap_chunk(): invalid pointer if the patch below is not applied. More details regarding this issue with a workaround for your own application can be found in the TensorRT Sample Support Guide.

--- a/samples/Makefile.config
+++ b/samples/Makefile.config
@@ -331,13 +331,13 @@ $(OUTDIR)/$(OUTNAME_DEBUG) : $(DOBJS) $(CUDOBJS)
 else
 $(OUTDIR)/$(OUTNAME_RELEASE) : $(OBJS) $(CUOBJS)
 	$(ECHO) Linking: $@
-	$(AT)$(CC) -o $@ $^ $(LFLAGS) -Wl,--start-group $(LIBS) -Wl,--end-group
+	$(AT)$(CC) -o $@ $(LFLAGS) -Wl,--start-group $(LIBS) $^ -Wl,--end-group
 	# Copy every EXTRA_FILE of this sample to bin dir
 	$(foreach EXTRA_FILE,$(EXTRA_FILES), cp -f $(EXTRA_FILE)$(OUTDIR)/$(EXTRA_FILE); )
 
 $(OUTDIR)/$(OUTNAME_DEBUG) : $(DOBJS) $(CUDOBJS)
 	$(ECHO) Linking: $@
-	$(AT)$(CC) -o $@ $^ $(LFLAGSD) -Wl,--start-group $(DLIBS) -Wl,--end-group
+	$(AT)$(CC) -o $@ $(LFLAGSD) -Wl,--start-group $(DLIBS) $^ -Wl,--end-group
 endif
 
 $(OBJDIR)/%.o: %.cpp

The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA 10.x. If selected, it will fallback to using cuBLAS.