These are the TensorRT 8.5.1 Release Notes and are applicable to x86 Linux,
Windows, JetPack, and PowerPC Linux users. This release incorporates Arm® based CPU cores for Server Base System Architecture (SBSA) users on
Linux only. This release includes several fixes from the previous TensorRT releases as
well as the following additional changes.
These Release Notes are applicable to workstation, server, and NVIDIA JetPack™ users unless appended specifically with (not applicable for
Jetson platforms).
For previously released TensorRT documentation, refer to the NVIDIA TensorRT Archived
Documentation.
Key Features and Enhancements
This TensorRT release includes the following key features and enhancements.
- Added support for NVIDIA Hopper™ (H100) architecture.
TensorRT now supports compute capability 9.0 deep learning kernels for FP32,
TF32, FP16, and INT8, using the H100 Tensor Cores and delivering increased
MMA throughput over A100. These kernels also benefit from new H100 features
such as Asynchronous Transaction Barriers, Tensor Memory Accelerator (TMA),
and Thread Block Clusters for increased efficiency.
- Added support for NVIDIA Ada Lovelace architecture. TensorRT now supports
compute capability 8.9 deep learning kernels for FP32, TF32, FP16, and
INT8.
- Ubuntu 22.04 packages are now provided in both the CUDA network repository
and in the local repository format starting with this release.
- Shapes of tensors can now depend on values computed on the GPU. For example,
the last dimension of the output tensor from INonZeroLayer
depends on how many input values are non-zero. For more information, refer
to the Dynamically Shaped Output
section in the TensorRT Developer Guide.
- TensorRT supports named input dimensions. In an ONNX model, two dimensions
with the same named dimension parameter are considered equal. For more
information, refer to the Named Dimensions section in
the TensorRT Developer Guide.
- TensorRT supports offloading the IShuffleLayer to DLA.
Refer to the Layer Support and
Restrictions section in the TensorRT Developer Guide
for details on the restrictions for running IShuffleLayer
on DLA.
- Added the following layers:
- IGatherLayer, ISliceLayer,
IConstantLayer, and IConcatenation
layers have been updated to support boolean types.
- INonZeroLayer, INMSLayer (non-max
suppression), IOneHotLayer, and
IGridSampleLayer.
For more information, refer to the TensorRT Operator’s
Reference.
- TensorRT supports heuristic-based builder tactic selection. This is
controlled by
nvinfer1::BuilderFlag::kENABLE_TACTIC_HEURISTIC. For
more information, refer to the Tactic Selection Heuristic
section in the TensorRT Developer Guide.
- The builder timing cache has been updated to support transformer-based
networks such as BERT and GPT. For more information, refer to the Timing Cache section in the
TensorRT Developer Guide.
- TensorRT supports the RoiAlign ONNX operator through the
newly added RoiAlign plug-in. Both opset-10 and opset-16
versions of the operator are supported. For more information about the
supported ONNX operators, refer to GitHub.
- TensorRT supports disabled external tactic sources including cuDNN and
cuBLAS use in the core library and allows the usage of cuDNN and cuBLAS in a
plug-in by setting the preview feature flag:
nvinfer1::PreviewFeature::kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805
using IBuilderConfig::setPreviewFeature.
- TensorRT supports a new preview feature
nvinfer1::PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805,
which aims to reduce build time, runtime and memory requirements for dynamic
shaped transformer-based networks
- TensorRT supports Lazy Module Loading; a CUDA feature, which can
significantly reduce the amount of GPU memory consumed. Refer to the CUDA Programming
Guide and the Lazy Module Loading section
in the TensorRT Developer Guide for more information.
- TensorRT supports persistent cache; a CUDA feature, which allows data cached
in L2 persistently. Refer to the Persistent Cache Management
section in the TensorRT Developer Guide for more information.
- TensorRT supports a preview feature API, which is a mechanism that enables
you to opt in for specific experimental features. Refer to the Preview Features section in
the TensorRT Developer Guide for more information.
- The following C++ API functions were added:
- ITensor::setDimensionName()
- ITensor::getDimensionName()
- IResizeLayer::setCubicCoeff()
- IResizeLayer::getCubicCoeff()
- IResizeLayer::setExcludeOutside()
- IResizeLayer::getExcludeOutside()
- IBuilderConfig::setPreviewFeature()
- IBuilderConfig::getPreviewFeature()
- ICudaEngine::getTensorShape()
- ICudaEngine::getTensorDataType()
- ICudaEngine::getTensorLocation()
- ICudaEngine::isShapeInferenceIO()
- ICudaEngine::getTensorIOMode()
- ICudaEngine::getTensorBytesPerComponent()
- ICudaEngine::getTensorComponentsPerElement()
- ICudaEngine::getTensorFormat()
- ICudaEngine::getTensorFormatDesc()
- ICudaEngine::getProfileShape()
- ICudaEngine::getNbIOTensors()
- ICudaEngine::getIOTensorName()
- IExecutionContext::getTensorStrides()
- IExecutionContext::setInputShape()
- IExecutionContext::getTensorShape()
- IExecutionContext::setTensorAddress()
- IExecutionContext::getTensorAddress()
- IExecutionContext::setInputTensorAddress()
- IExecutionContext::getOutputTensorAddress()
- IExecutionContext::inferShapes()
- IExecutionContext::setInputConsumedEvent()
- IExecutionContext::getInputConsumedEvent()
- IExecutionContext::setOutputAllocator()
- IExecutionContext::getOutputAllocator()
- IExecutionContext::getMaxOutputSize()
- IExecutionContext::setTemporaryStorageAllocator()
- IExecutionContext::getTemporaryStorageAllocator()
- IExecutionContext::enqueueV3()
- IExecutionContext::setPersistentCacheLimit()
- IExecutionContext::getPersistentCacheLimit()
- IExecutionContext::setNvtxVerbosity()
- IExecutionContext::getNvtxVerbosity()
- INetworkDefinition::addOneHot()
- INetworkDefinition::addNonZero()
- INetworkDefinition::addGridSample()
- INetworkDefinition::addNMS()
- The following C++ classes were added:
- IOneHotLayer
- IGridSampleLayer
- INonZeroLayer
- INMSLayer
- IOutputAllocator
- The following C++ enum values were added:
- InterpolationMode::kCUBIC
- FillOperation::kRANDOM_NORMAL
- BuilderFlag::kREJECT_EMPTY_ALGORITHMS
- BuilderFlag::kENABLE_TACTIC_HEURISTIC
- TacticSource::kJIT_CONVOLUTIONS
- DataType::kUINT8
- The following C++ enum classes were added:
- TensorIOMode
- PreviewFeature
- The following Python API functions/properties were added:
- ITensor.set_dimension_name()
- ITensor.get_dimension_name()
- IResizeLayer.cubic_coeff
- IResizeLayer.exclude_outside
- IBuilderConfig.set_preview_feature()
- IBuilderConfig.get_preview_feature()
- ICudaEngine.get_tensor_shape()
- ICudaEngine.get_tensor_dtype()
- ICudaEngine.get_tensor_location()
- ICudaEngine.is_shape_inference_io()
- ICudaEngine.get_tensor_mode()
- ICudaEngine.get_tensor_bytes_per_component()
- ICudaEngine.get_tensor_components_per_element()
- ICudaEngine.get_tensor_format()
- ICudaEngine.get_tensor_format_desc()
- ICudaEngine.get_tensor_profile_shape()
- ICudaEngine.num_io_tensors
- ICudaEngine.get_tensor_name()
- IExecutionContext.get_tensor_strides()
- IExecutionContext.set_input_shape()
- IExecutionContext.get_tensor_shape()
- IExecutionContext.set_tensor_address()
- IExecutionContext.get_tensor_address()
- IExecutionContext.infer_shapes()
- IExecutionContext.set_input_consumed_event()
- IExecutionContext.get_input_consumed_event()
- IExecutionContext.set_output_allocator()
- IExecutionContext.get_output_allocator()
- IExecutionContext.get_max_output_size()
- IExecutionContext.temporary_allocator
- IExecutionContext.execute_async_v3()
- IExecutionContext.persistent_cache_limit
- IExecutionContext.nvtx_verbosity
- INetworkDefinition.add_one_hot()
- INetworkDefinition.add_non_zero()
- INetworkDefinition.add_grid_sample()
- INetworkDefinition.add_nms()
- The following Python classes were added:
- IOneHotLayer
- IGridSampleLayer
- INonZeroLayer
- INMSLayer
- IOutputAllocator
- The following Python enum values were added:
- InterpolationMode.CUBIC
- FillOperation.RANDOM_NORMAL
- BuilderFlag.REJECT_EMPTY_ALGORITHMS
- BuilderFlag.ENABLE_TACTIC_HEURISTIC
- TacticSource.JIT_CONVOLUTIONS
- DataType.UINT8
- The following Python enum classes were added:
- TensorIOMode
- PreviewFeature
- Removed the TensorRT layers chapter from the TensorRT Developer Guide
appendix section and created a standalone TensorRT Operator’s
Reference document.
Deprecated API Lifetime
- APIs deprecated before TensorRT 8.0 will be removed in TensorRT 9.0.
- APIs deprecated in TensorRT 8.0 will be retained until at least 8/2022.
- APIs deprecated in TensorRT 8.2 will be retained until at least
11/2022.
- APIs deprecated in TensorRT 8.4 will be retained until at least 2/2023.
- APIs deprecated in TensorRT 8.5 will be retained until at least 9/2023.
Refer to the API documentation (C++, Python) for how to update your code
to remove the use of deprecated features.
Compatibility
- TensorRT 8.5.1 has been tested with the following:
- This TensorRT release supports CUDA®:
- It is suggested that you use TensorRT with a software stack that has been
tested; including cuDNN and cuBLAS versions as documented in the Features For Platforms And
Software section. Other semantically compatible releases of cuDNN
and cuBLAS can be used; however, other versions may have performance
improvements as well as regressions. In rare cases, functional regressions might
also be observed.
Limitations
- There are two modes of DLA softmax where the mode is chosen automatically
based on the shape of the input tensor, where:
- the first mode triggers when all nonbatch, non-axis dimensions are
1, and
- the second mode triggers in other cases if valid.
The second of the two modes is supported only for DLA 3.9.0 and
later. It involves approximations that may result in errors of a small
degree. Also, batch size greater than 1 is supported only for DLA 3.9.0
and later. Refer to the DLA Supported
Layers section in the TensorRT Developer Guide
for details.
- On QNX, networks that are segmented into a large number of DLA loadables may
fail during inference.
- You may encounter an error such as, “Unable to load library:
nvinfer_builder_resource.dll”, if using Python 3.9.10
on Windows. You can workaround this issue by downgrading to an earlier
version of Python 3.9.
- Under some conditions, RNNv2Layer can require a larger
workspace size in TensorRT 8.0 than TensorRT 7.2 in order to run all
supported tactics. Consider increasing the workspace size to work around
this issue.
- CUDA graph capture will capture inputConsumed and profiler
events only when using the build for 11.x and >= 11.1 driver (455 or
later).
- The DLA compiler is capable of removing identity transposes, but it cannot
fuse multiple adjacent transpose layers into a single transpose layer
(likewise for reshape). For example, given a TensorRT IShuffleLayer
consisting of two non-trivial transposes and an identity reshapes
in between. The shuffle layer is translated into two consecutive DLA
transpose layers, unless you merge the transposes together manually in the
model definition in advance.
- In QAT networks, for group convolution that has a Q/DQ pair before but no
Q/DQ pair after, we can run in INT8-IN-FP32-OUT mix precision before.
However, GPU kernels may be missed and fall back to FP32-IN-FP32-OUT in the
NVIDIA Hopper™ architecture GPUs if the input channel is
small. This will be fixed in the future release.
- On PowerPC platforms, samples that depend on TensorFlow, ONNX Runtime, and
PyTorch are unable to run due to missing Python module dependencies. These
frameworks have not been built for PowerPC and/or published to standard
repositories.
Deprecated and Removed Features
The following features are deprecated in TensorRT 8.5.1:
- TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x)
devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT
9.0.
Fixed Issues
- TensorRT’s optimizer would sometimes incorrectly retarget Concat layer
inputs produced by Cast layers, resulting in data corruption. This has been
fixed in this release.
- There was an up to 5% performance drop for the ShuffleNet network compared
to TensorRT 8.2 when running in INT8 precision on NVIDIA Ampere architecture
GPUs. This has been fixed in this release.
- There was an up to 10% performance difference for the WaveRNN network
between different OS when running in FP16 precision on NVIDIA Ampere
architecture GPUs. This issue has been fixed in this release.
- When the TensorRT static library was used to build engines and the
NVPTXCompiler static library was used outside of the TensorRT core library
at the same time, it was possible to trigger a crash of the process in rare
cases. This issue has been fixed in this release.
- There was a known issue when ProfilingVerbosity is set to
kDETAILED, the enqueueV2() call might
take up to 2 ms compared to ProfilingVerbosity=kNONE or
kLAYER_NAMES_ONLY. Now, you can use the
setNvtxVerbosity() API to disable the costly detailed
NVTX generation at runtime without the need to rebuild the engine.
- There was a performance regression compared to TensorRT 7.1 for some
networks dominated by FullyConnected with activation and bias operations:
- up to 12% in FP32 mode
- up to 10% in FP16 mode on NVIDIA Maxwell® and
NVIDIA Pascal® GPUs
This issue has been fixed in this release.
- There was an issue when performing PTQ with TensorRT with tensors rank > 4,
some layers may cause an assertion about invalid Region Dims. Empty tensors
in the network may also cause a seg fault within the INT8 calibration
process. These issues were fixed in this release.
- When performing an L2_Normalization in float16 precision, there was
undefined behavior occurring from a fusion. This fusion could be disabled by
marking the input to the L2_Normalization as a network output. This issue
has been fixed in this release.
- TensorRT used to incorrectly allow subnetworks of the input network with >
16 I/O tensors to be offloaded to DLA, due to intermediate tensors of the
subnetwork that were also the full network outputs not having been counted
as I/O tensors. This has been fixed. You may infrequently experience a
slightly increased fragmentation.
- There was a ~19% performance drop on NVIDIA Turing GPUs for the DRIVENet
network in FP16 precision compared to TensorRT 8.4. This regression has been
fixed in this release.
- On NVIDIA Hopper GPUs, the QKV plug-in, which is used by the open-source
BERT demo, would fail for fixed sequence lengths of 128 and 384 with FP16.
This issue has been fixed in this release.
- To compile the DLA samples, you can now use the listed commands in the
README.
- In convolution or GEMM layers, some extra tactic information would be
printed out. This issue has been fixed in this release.
- When using QAT, horizontal fusion of convolutions at the end of the net
would incorrectly propagate the quantization scales of the weights,
resulting in incorrect outputs. This issue has been fixed in this
release.
- With TensorRT ONNX runtime, building an engine for a large graphic would
fail due to the implementation of a foreign node not being found. This issue
has been fixed in this release.
- The BatchedNMSDynamicPlugin supports dynamic batch only.
The behavior was undefined if other dimensions were set to be dynamic. This
issue has been fixed in this release.
- On NVIDIA Hopper GPUs, the QKV plug-in, which is used by the open-source
BERT demo, would produce inaccurate results for sequence lengths other than
128 and 384. This issue has been fixed in this release.
- A new tactic source kJIT_CONVOLUTIONS was added, however,
enabling or disabling it had no impact as it was still in development. This
issue has been fixed in this release.
- There was a known issue when using INT8 calibration for networks with
ILoop or IIfConditional layers. This
issue has been fixed in this release.
- There was a ~17% performance drop on NVIDIA Ampere GPUs for the inflated 3D
video classification network in TF32 precision compared to TensorRT
8.4.
- There were up to 25% performance drops for various networks on SBSA systems
compared to TensorRT 8.4.
- There was a ~15% performance drop on NVIDIA Ampere GPUs for the
ResNet_v2_152 network in TF32 precision compared to TensorRT 8.4.
- There was a 17% performance drop for networks containing Deconv+Concat or
Slice+Deconv patterns.
- There was a ~9% performance drop on Volta and Turing GPUs for the WaveRNN
network.
- Some networks would see a small increase in deserialization time. This issue
has been fixed in this release.
- When using QAT, horizontal fusion of two or more convolutions that have
quantized inputs and non-quantized outputs would result in incorrect weights
quantization. This issue has been fixed in this release.
- When invoking IQuantizeLayer::setAxis with the axis set to
-1, the graph optimization process triggered an
assertion. This issue has been fixed in this release.
- If the values (not just dimensions) of an output tensor from a plugin
were used to compute a shape, the engine would fail to build. This issue has
been fixed in this release.
Announcements
- In the next TensorRT release, CUDA toolkit 10.2 support will be
dropped.
- TensorRT 8.5 will be the last release supporting NVIDIA Kepler (SM 3.x)
devices. Support for Maxwell (SM 5.x) devices will be dropped in TensorRT
9.0.
- In the next TensorRT release, cuDNN, cuBLAS, and cuBLASLt tactic sources
will be turned off by default in builder profiling. TensorRT plans to remove
the cuDNN, cuBLAS, and cuBLASLt dependency in future releases. Use the
PreviewFeature flag
kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 to
evaluate the functional and performance impact of disabling cuBLAS and cuDNN
and report back to TensorRT if there are critical regressions in your use
cases.
- TensorRT Python wheel files before TensorRT 8.5, such as TensorRT 8.4, were
published to the NGC PyPI repo. Starting with TensorRT 8.5, Python wheels
will instead be published to upstream PyPI. This will make it easier to
install TensorRT because it requires no prerequisite steps. Also, the name
of the Python package for installation has changed from
nvidia-tensorrt to just tensorrt.
- The C++ and Python API documentation in previous releases was included
inside the tar file packaging. This release no longer bundles the
documentation inside the tar file since the online documentation can be
updated post release and avoids encountering mistakes found in stale
documentation inside the packages.
Known Issues
Functional
- There are known issues reported by the Valgrind memory leak check tool when
detecting potential memory leaks from TensorRT applications. The
recommendation to suppress the issues is to provide a Valgrind suppression
file with the following contents when running the Valgrind memory leak check
tool. Add the option --keep-debuginfo=yes to the Valgrind
command line to suppress these
errors.
{
Memory leak errors with dlopen.
Memcheck:Leak
match-leak-kinds: definite
...
fun:*dlopen*
...
}
{
Memory leak errors with nvrtc
Memcheck:Leak
match-leak-kinds: definite
fun:malloc
obj:*libnvrtc.so*
...
}
- The Python sample yolov3_onnx has a known issue when installing the
requirements with Python 3.10. The recommendation is to use a Python version
< 3.10 when running the sample.
- The auto-tuner assumes that the number of indices returned by
INonZeroLayer is half of the number of input elements.
Thus, networks that depend on tighter assumptions for correctness may fail
to build.
- SM 7.5 and earlier devices may not have INT8 implementations for all layers
with Q/DQ nodes. In this case, you will encounter a could not find
any implementation error while building your engine. To resolve
this, remove the Q/DQ nodes, which quantize the failing layers.
- One of the deconvolution algorithms sourced from cuDNN exhibits
non-deterministic execution. Disabling cuDNN tactics will prevent this
algorithm from being chosen (refer to
IBuilderConfig::setTacticSources).
- TensorRT in FP16 mode does not perform cast operations correctly when only
the output types are set, but not the layer precisions.
- TensorRT does not preserve precision for operations that are imported from
ONNX models in FP16 mode.
- There is a known functional issue (fails with a CUDA error during
compilation) with networks using ILoop layers on the WSL
platform.
- The tactic source cuBLASLt cannot be selected on SM 3.x devices for CUDA
10.x. If selected, it will fall back to using cuBLAS. (not applicable for
Jetson platforms)
- Installing the cuda-compat-11-4 package may interfere with
CUDA enhanced compatibility and cause TensorRT to fail even when the driver
is r465. The workaround is to remove the cuda-compat-11-4
package or upgrade the driver to r470. (not applicable for Jetson
platforms)
- TensorFlow 1.x is not supported for Python 3.9 or newer. Any Python samples
that depend on TensorFlow 1.x cannot be run with Python 3.9 or newer.
- You may see the following error:
"Could not load library libcudnn_ops_infer.so.8. Error: libcublas.so.11: cannot open shared
object file: No such file or directory"
after installing TensorRT from the network repo. cuDNN depends on the RPM
dependency libcublas.so.11()(64bit), however, this
dependency installs cuBLAS from CUDA 11.0 rather than cuBLAS from the latest
CUDA release. The library search path will not be set up correctly and cuDNN
will be unable to find the cuBLAS libraries. The workaround is to install
the latest libcublas-11-x package manually.
- For some networks, using a batch size of 4096 may cause accuracy degradation
on DLA.
- When using DLA, an elementwise, unary, or activation layer immediately
followed by a scale layer may lead to accuracy degradation in INT8 mode.
Note that this is a pre-existing issue also found in previous releases
rather than a regression.
- When using DLA, INT8 convolutions followed by FP16 layers may cause accuracy
degradation. In such cases, either change the convolution to FP16 or the
subsequent layer to INT8.
- When using the algorithm selector API, the HWC1 and HWC4 DLA formats are
both reported as TensorFormat::kDLA_HWC4.
- For transformer decoder based models (such as GPT2) with sequence length as
dynamic, TensorRT 8.5 requires additional workspace (up to 2x) as compared
to previous releases.
Performance
- There is a ~12% performance drop on NVIDIA Ampere architecture GPUs for the
BERT network on Windows systems.
- There is a known performance issue when running instance normalization
layers on Arm Server Base System Architecture (SBSA).
- There is an up to 22% performance drop for Jasper networks compared to
TensorRT 8.2 when running in FP32 precision on NVIDIA Volta or NVIDIA Turing
GPUs with CUDA 10.2. This performance drop can be avoided if CUDA 11.x is
used instead.
- There is an up to 5% performance drop for the InceptionV4 network compared
to TensorRT 8.2 when running in FP32 precision on NVIDIA Volta GPUs with
CUDA 10.2. This performance drop can be avoided if CUDA 11.x is used
instead.
- There is an up to 27% performance drop for BART compared to TensorRT 8.2
when running with both FP16 and INT8 precisions enabled on T4. This
performance drop can be fixed by disabling the INT8 precision flag.
- There is an up to 10% performance drop for the SegResNet network compared to
TensorRT 8.2 when running in FP16 precision on NVIDIA Ampere architecture
GPUs due to a cuDNN regression in the InstanceNormalization
plug-in. This will be fixed in a future TensorRT release. You can work
around the regression by reverting the cuDNN version to cuDNN 8.2.1.
- There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA
Orin as compared to when running the layer on a GPU, with a larger drop for
larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32%
drop when the network runs on DLA as compared to when the last SoftMax layer
runs on a GPU.
- There is an up to 20% performance variation between different engines built
from the same network for some LSTM networks due to unstable tactic
selections.
- There is an up to 11% performance variation for some LSTM networks during
inference depending on the order of CUDA stream creation on NVIDIA Turing
GPUs. This will be fixed in r525 drivers.
- Due to the difference in DLA hardware specification between NVIDIA Orin and
Xavier, a relative increase in latency is expected when running DLA FP16
operations involving convolution (which includes deconvolution,
fully-connected, and concat) on NVIDIA Orin as compared to running on
Xavier. At the same DLA clocks and memory bandwidth, INT8 convolution
operations on NVIDIA Orin are expected to be about 4x faster than on Xavier,
whereas FP16 convolution operations on NVIDIA Orin are expected to be about
40% slower than on Xavier.
- There is a known issue with DLA clocks that requires users to reboot the
system after changing the nvpmodel power mode or otherwise experience a
performance drop. Refer to the L4T board support package Release Notes for
details.
- For transformer-based networks such as BERT and GPT, TensorRT can consume
CPU memory up to 10 times the model size during compilation.
- There is an up to 17% performance regression for DeepASR networks at BS=1 on
NVIDIA Turing GPUs.
- There is an up to 7.5% performance regression compared to TensorRT 8.0.1.6
on NVIDIA Jetson AGX Xavier™ for ResNeXt networks in FP16
mode.
- There is an up to 10-11% performance regression on Xavier compared to
TensorRT 7.2.3 for ResNet-152 with batch size 2 in FP16.
- There is an up to 40% regression compared to TensorRT 7.2.3 for DenseNet
with CUDA 11.3 on P100 and V100. The regression does not exist with CUDA
11.0. (not applicable for Jetson platforms)
- On Xavier, DLA automatically upgrades INT8 LeakyRelu layers to FP16 to
preserve accuracy. Thus, latency may be worse compared to an equivalent
network using a different activation like ReLU. To mitigate this, you can
disable LeakyReLU layers from running on DLA.
- There is an up to 126% performance drop when running some ConvNets on DLA in
parallel to the other DLA and the iGPU on Xavier platforms, compared to
running on DLA alone.
- There is an up to 5% performance drop for networks using sparsity in FP16
precision.
- There is an up to 5% performance drop for Megatron networks in FP32
precision at batch-size = 1 between CUDA 11.8 and CUDA 10.2
on NVIDIA Volta GPUs. This performance drop does not happen on NVIDIA Turing
or later GPUs.
- There is an up to 23% performance drop between H100 and A100 for some
ConvNets in TF32 precision when running at the same SM clock frequency. This
will be improved in future TRT versions.
- There is an up to 8% performance drop between H100 and A100 for some
transformers, including BERT, BART, T5, and GPT2, in FP16 precision at BS=1
when running at the same SM clock frequency. This will be improved in future
TensorRT versions.
- H100 performance for some ConvNets in TF32 precision is not fully optimized.
This will be improved in future TensorRT versions.
- There is an up to 6% performance drop for ResNeXt-50 QAT networks in INT8,
FP16, and FP32 precision at batch-size = 1 compared to
TensorRT 8.4 on NVIDIA Volta GPUs.
- H100 performance for some Transformers in FP16 precision is not fully
optimized. This will be improved in future TensorRT versions.
- H100 performance for some ConvNets containering depthwise convolutions (like
QuartzNets and EfficientDet-D0) in INT8 precision is not fully optimized.
This will be improved in future TensorRT versions.
- H100 performance for some LSTMs in FP16 precision is not fully optimized.
This will be improved in future TensorRT versions.
- H100 performance for some 3DUnets is not fully optimized. This will be
improved in future TensorRT versions.
- There is an up to 6% performance drop for OpenRoadNet networks in TF32
precision compared to TensorRT 8.4 on NVIDIA Ampere architecture GPUs.
- There is an up to 6% performance drop for T5 networks in FP32 precision
compared to TensorRT 8.4 on NVIDIA Volta GPUs due to a functionality
fix.
- There is an up to 5% performance drop for UNet networks in INT8 precision
with explicit quantization on CUDA 11.x compared to CUDA 10.2 on NVIDIA
Turing GPUs.
- There is an up to 6% performance drop for WaveRNN networks in FP16 precision
compared to TensorRT 8.4 on CUDA 11.8 on NVIDIA Volta GPUs. Downgrading CUDA
to CUDA 11.6 fixes the issue.
- There is an up to 13% performance drop for Megatron networks in FP16
precision on Tesla T4 GPUs when
disableExternalTacticSourcesForCore0805 is
enabled.
- There is an up to 16% performance drop for LSTM networks in FP32 precision
compared to TensorRT 8.4 on NVIDIA Pascal GPUs.
- There is an up to 17% performance drop for LSTM on Windows in FP16 precision
compared to TensorRT 8.4 on NVIDIA Volta GPUs.
- There is an up to 7% performance drop for Artifact Reduction networks
involving Deconvolution ops in INT8 precision compared to TensorRT 8.4 on
NVIDIA Volta GPUs.
- With the kFASTER_DYNAMIC_SHAPES_0805 preview feature
enabled on the GPT style decoder models, there can be an up to 20%
performance regression for odd sequence lengths only compared to TensorRT
without the use of the preview feature.