This is the TensorRT 8.2.1 release notes and is applicable to x86 Linux and
Windows users, as well as incorporates ARM® based CPU cores for Server
Base System Architecture (SBSA) users on Linux only.
These release notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for
Jetson platforms).
This release includes several fixes from the previous TensorRT 8.x.x release as well
as the following additional changes. For previous TensorRT documentation, see the
NVIDIA TensorRT Archived
Documentation.
Key Features And Enhancements
This TensorRT release includes the following key features and enhancements.
- WSL (Windows Subsystem for Linux) 2 is released as a preview feature in this
TensorRT 8.2.1 GA release.
Deprecated API Lifetime
- APIs deprecated prior to TensorRT 8.0 will be removed in TensorRT 9.0.
- APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
- APIs deprecated in TensorRT 8.2 will be retained until at least
11/2022.
Refer to the API documentation (C++, Python) for how to update your code
to remove the use of deprecated features.
Compatibility
- TensorRT 8.2.1 has been tested with the following:
- This TensorRT release supports NVIDIA CUDA®:
- It is suggested that you use TensorRT with a software stack that has been
tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And
Software section. Other semantically compatible releases of cuDNN
and cuBLAS can be used, however, other versions may have performance
improvements as well as regressions. In rare cases, functional regressions might
also be observed.
Limitations
- DLA does not support hybrid precision for pooling layer – data type of input
and output tensors should be the same as the layer precision i.e. either all
INT8 or all FP16.
Deprecated And Removed Features
The following features are deprecated in TensorRT 8.2.1:
- BuilderFlag::kSTRICT_TYPES is deprecated. Its functionality
has been split into separate controls for precision constraints,
reformat-free I/O, and failure of
IAlgorithmSelector::selectAlgorithms. This change
enables users who need only one of the subfeatures to build engines without
encumbering the optimizer with the other subfeatures. In particular,
precision constraints are sometimes necessary for engine accuracy, however,
reformat-free I/O risks slowing down an engine with no benefit. For more
information, refer to BuilderFlags (C++, Python).
- The LSTM plugin has been removed. In addition, the Persistent LSTM Plugin
section has also been removed from the TensorRT Developer Guide.
Announcements
- The sample sample_reformat_free_io has been renamed to sample_io_formats,
and revised to remove the deprecated flag
BuilderFlag::kSTRICT_TYPES. Reformat-free I/O is still
available with BuilderFlag::kDIRECT_IO, but generally
should be avoided since it can result in a slower than necessary engine, and
can cause a build to fail if the target platform lacks the kernels to enable
building an engine with reformat-free I/O.
- The NVIDIA TensorRT Release Notes PDF will no longer be available in the
product package after this release. The release notes will still remain
available online here.
Known Issues
Functional
- TensorRT attempts to catch GPU memory allocation failure and avoid profiling
tactics whose memory requirements would trigger Out of Memory. However, GPU
memory allocation failure cannot be handled by CUDA gracefully on some
platforms and would lead to an unrecoverable application status. If this
happens, consider lowering the specified workspace size if a large size is
set, or using the IAlgorithmSelector interface to avoid
tactics that require a lot of GPU memory.
- TensorRT may experience some instabilities when running networks containing
TopK layers on T4 under Azure VM. To workaround this issue, disable
CUBLAS_LT kernels with
--tacticSources=-CUBLAS_LT (setTacticSources).
- Under certain conditions on WSL2, an INetwork with
Convolution layers that can be horizontally fused before a Concat layer may
create an internal error causing the application to crash while building the
engine. As a workaround, build your network on Linux instead of WSL2.
- When running ONNX models with dynamic shapes, there is a potential accuracy
issue if the dimension names of the inputs that are expected to be the same
are not. For example, if a model has two 2D inputs of which the dimension
semantics are both batch and seqlen, and
in the ONNX model, the dimension name of the two inputs are different, there
is a potential accuracy issue when running with dynamic shapes. Ensure you
the dimension semantics match when exporting ONNX models from
frameworks.
- There is a known functional issue (fails with a CUDA error during
compilation) with networks using ILoop layers on the WSL
platform.
- The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA
10.x. If selected, it will fallback to using cuBLAS. (not applicable for
Jetson platforms)
- For some networks with large amounts of weights and activation data, DLA may
fail compiling a subgraph, and that subgraph will fallback to GPU.
- Under some conditions, RNNv2Layer can require a larger
workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all
supported tactics. Consider increasing the workspace size to work around
this issue.
- CUDA graph capture will capture inputConsumed and profiler
events only when using the build for 11.x and >= 11.1 driver (455 or
above).
- On integrated GPUs, a memory tracking issue in TensorRT 8.0 that was
artificially restricting the amount of available memory has been fixed. A
side effect is that the TensorRT optimizer is able to choose layer
implementations that use more memory, which can cause the OOM Killer to
trigger for networks where it previously didn't. To work around this
problem, use the IAlgorithmSelector interface to avoid
layer implementations that require a lot of memory, or use the layer
precision API to reduce precision of large tensors and use
STRICT_TYPES, or reduce the size of the input tensors
to the builder by reducing batch or other higher dimensions.
- For some transformer based networks built with PyTorch Multi-head Attention
API, the performance may be up to 45% slower than similar
networks built with other APIs due to different graph patterns.
- TensorRT bundles a version of libnvptxcompiler_static.a
inside libnvinfer_static.a. If an application links with a
different version of PTXJIT than the version used to build TensorRT, it may
lead to symbol conflicts or undesired behavior.
- Installing the cuda-compat-11-4 package may interfere with
CUDA enhanced compatibility and cause TensorRT to fail even when the driver
is r465. The workaround is to remove the cuda-compat-11-4
package or upgrade the driver to r470. (not applicable for Jetson
platforms)
- TensorFlow 1.x is not supported for Python 3.9. Any Python samples that
depend on TensorFlow 1.x cannot be run with Python 3.9.
- TensorRT has limited support for fusing IConstantLayer and
IShuffleLayer. In explicit-quantization mode, the
weights of Convolutions and Fully-Connected layers must be fused. Therefore,
if a weights-shuffle is not supported, it may lead to failure to quantize
the layer.
- For DLA networks where a convolution layer consumes an NHWC network input,
the compute precision of the convolution layer must match the data type of
the input tensor.
- Hybrid precision is not supported with the Pooling layer. Data type of input
and output tensors should be the same as the layer precision.
- When installing PyCUDA, NumPy must be installed first and as a separate
step:
python3 -m pip install numpy
python3 -m pip install pycuda
For
more information, refer to the NVIDIA TensorRT Installation
Guide.
- When running the Python engine_refit_mnist, network_api_pytorch_mnist, or
onnx_packnet samples, you may encounter Illegal instruction (core
dumped) when using the CPU version of PyTorch on Jetson TX2.
The workaround is to install a GPU enabled version of PyTorch as per the
instructions in the sample READMEs.
- If an IPluginV2 layer produces kINT8 outputs that are
output tensors of an INetworkDefinition that have
floating-point type, an explicit cast is required to convert the network
outputs back to a floating point format. For
example:
// out_tensor is of type nvinfer1::DataType::kINT8
auto cast_input = network->addIdentity(*out_tensor);
cast_input->setOutputType(0, nvinfer1::DataType::kFLOAT);
new_out_tensor = cast_input->getOutput(0);
- Intermittent accuracy issues are observed in sample_mnist with INT8 precision
on WSL2.
- The debian and RPM packages for the Python bindings, UFF, GraphSurgeon, and
ONNX-GraphSurgeon wheels do not install their dependencies automatically;
when installing them, ensure you install the dependencies manually using
pip, or install the wheels instead.
- You may see the following error:
Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
object file: No such file or directory
after installing TensorRT from the network repo. cuDNN depends on the RPM
dependency libcublas.so.11()(64bit), however, this
dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest
CUDA release. The library search path will not be set up correctly and cuDNN
will be unable to find the cuBLAS libraries. The workaround is to install
the latest libcublas-11-x package manually.
- There is a known issue on Windows with the Python sample uff_ssd when
converting the frozen TensorFlow graph into UFF. You can generate the UFF
model on Linux or in a container and copy it over to work around this issue.
Once generated, copy the UFF file to
\path\to\samples\python\uff_ssd\models\ssd_inception_v2_coco_2017_11_17\frozen_inference_graph.uff.
- ONNX models with MatMul operations that use QuantizeLinear/DequantizeLinear
operations to quantize the weights, and pre-transpose the weights (i.e. do
not use a separate Transpose operation) will suffer from accuracy errors due
to a bug in the quantization process.
Performance
- There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6
on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16
mode.
- There is an up to 15% performance regression for networks with a Pooling
layer located before or after a Concatenate layer.
- There is a performance regression compared to TensorRT 7.1 for some networks
dominated by FullyConnected with activation and bias operations:
- up to 12% in FP32 mode. This will be fixed in a future release.
- up to 10% in FP16 mode on NVIDIA Maxwell® and
Pascal GPUs.
- There is an up to 8% performance regression compared to TensorRT 7.1 for
some networks with heavy FullyConnected operation like VGG16 on NVIDIA
Jetson Nano™.
- There is an up to 10-11% performance regression on Xavier:
- compared to TensorRT 7.2.3 for ResNet-152 with batch size 2 in
FP16.
- compared to TensorRT 6 for ResNeXt networks with small batch (1 or
2) in FP32.
- For networks that use deconv with large kernel size, the engine build time
can increase a lot for this layer on Xavier. It can also lead to the
launch timed out and was terminated error message on Jetson
Nano/TX1.
- There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet
with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA
11.0. (not applicable for Jetson platforms)
- There is an up to 10% performance regression compared to TensorRT 7.2.3 in
JetPack 4.5 for ResNet-like networks on NVIDIA DLA when the dynamic ranges
of the inputs of the ElementWise ADD layers are different.
This is due to a fix for a bug in DLA where it ignored the dynamic range of
the second input of the ElementWise ADD layers and caused
some accuracy issues.
- DLA automatically upgrades INT8 LeakyRelu layers to FP16 to preserve
accuracy. Thus, latency may be worse compared to an equivalent network using
a different activation like ReLU. To mitigate this, you can disable
LeakyReLU layers from running on DLA.
- The builder may require up to 60% more memory to build an engine.
- There is an up to 126% performance drop when running some ConvNets on DLA in
parallel to the other DLA and the iGPU on Xavier platforms, compared to
running on DLA alone.
- There is an up to 21% performance drop compared to TensorRT 8.0 for
SSD-Inception2 networks on NVIDIA Volta GPUs.
- There is an up to 5% performance drop for networks using sparsity in FP16
precision.
- There is an up to 25% performance drop for networks using the
InstanceNorm plugin. This issue is being
investigated.
- The engine building time for the networks using 3D convolution, like 3d_unet,
is up to 500% longer compared to TensorRT 8.0 due to many fast kernels being
added in, which enlarges the profiling time.