TensorRT 10.7.0 Release Notes#
These Release Notes apply to x86 Linux and Windows users, ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux, and JetPack users. This release includes several fixes from the previous TensorRT releases and additional changes.
Announcements#
Platform updates: Starting with TensorRT 10.8 (the next TensorRT release), the minimum glibc version for the Linux x86 build will be 2.28. Currently (TensorRT 10.7), the minimum glibc version for the Linux x86 build is 2.17. TensorRT 10.8 is expected to be compatible with RedHat 8.x (and derivatives) and newer RedHat distributions. It is also expected to be compatible with Ubuntu 20.04 and newer Ubuntu distributions.
Key Features and Enhancements#
Developer Tools
Nsight Deep Learning Designer Support: Performance profiling and engine building with TensorRT 10.7 is now supported by Nsight Deep Learning Designer 2024.1.
Performance Optimizations
GroupNormalization Performance: Improved
GroupNormalizationperformance on Hopper GPUs when theGroupNormalizationis right after a Convolution operation.Vision Transformer Improvements: Improved Vision Transformer performance when FP16 Multi-Head Attention (MHA) is used together with FP8 GEMMs.
Multi-Head Attention (MHA) Enhancements: Enabled FP8 Multi-Head Attention (MHA) fusions for Ada GPUs for larger sequence lengths and head sizes.
FP8 GEMM Scaling: Enabled per-channel scaling for FP8 GEMMs on Ada GPUs.
API Enhancements
Engine Deserialization API: Added a new API for deserializing CUDA engines:
ICudaEngine* deserializeCudaEngine(IStreamReaderV2& streamReader). This new method introduces theIStreamReaderV2interface, which supports reading to both host and device pointers. It enables potential performance improvements, especially when used with NVIDIA GPUDirect or weight streaming. This new API is designed to replace the olderIStreamReader-based deserialization method.Squeeze and Unsqueeze Layers: Added two new network layers:
ISqueezeLayerandIUnsqueezeLayerto represent squeeze and unsqueeze operations with shape tensor axes compared to usingIShuffleLayer.API Change Tracking: To view API changes between releases, refer to the TensorRT GitHub repository and use the compare tool.
Bug Fixes and Performance
Multi-Stream Mode Fix: Fixed an issue where TensorRT could hang in multi-stream mode when certain kernels were selected and fused on Hopper GPUs due to a warp synchronization bug.
BERT Plugin Fix: Fixed an issue where
CustomQKVToContextPluginDynamicversion 4 (bertQKVToContextPluginversion 4) raised an internal error if either the batch or sequence dimension differed at runtime from the ones used to serialize the engine.OneHot Layer Enhancement:
IOneHotLayercan now compute a shape tensor. Previously the builder would fail with a message that anIOneHotLayercannot be used to compute a shape tensor.Debug Tensor Fix: Fixed a failure when multiple graph input tensors were set as debug tensors by checking whether the tensor has already been decorated as a debug tensor and avoiding performing
insertProcessDebugTensorOpsinIConfigtwice on the same input tensor.
Compatibility#
TensorRT 10.7.0 has been tested with the following:
PyTorch >= 2.0 (refer to the
requirements.txtfile for each sample)
This TensorRT release supports NVIDIA CUDA:
This TensorRT release requires at least NVIDIA driver r450 on Linux or r452 on Windows as required by CUDA 11.0, the minimum CUDA version supported by this TensorRT release.
Limitations#
There is a known issue with using the
markDebugAPI to mark multiple graph input tensors as debug tensors.There are no optimized FP8 Convolutions for Group Convolutions and Depthwise Convolutions; therefore, INT8 is still recommended for ConvNets containing these convolution ops.
The FP8 Convolutions only support input/output channels that are multiples of 16; otherwise, TensorRT will fall back to non-FP8 convolutions.
The FP8 Convolutions do not support kernel sizes larger than 32 (for example, 7x7 convolutions); FP16 or FP32 fallback kernels will be used with suboptimal performance. Therefore, do not add FP8 Q/DQ ops before Convolutions with large kernel sizes for better performance.
The accumulation
dtypefor the batched GEMMS in the FP8 Multi-Head Attention (MHA) must be in FP32.This can be achieved by adding Cast (to FP32) ops before the batched GEMM and Cast (to FP16) after the batched GEMM.
Alternatively, you can convert your ONNX model using TensorRT Model Optimizer, which adds the Cast ops automatically.
There cannot be any pointwise operations between the first batched GEMM and the softmax inside FP8 Multi-Head Attention (MHA) (for example, having an attention mask). This will be improved in future TensorRT releases.
The FP8 Multi-Head Attention (MHA) fusions only support head sizes being multiples of 16. If the Multi-Head Attention (MHA) has a head size that is not a multiple of 16, do not add Q/DQ ops in the Multi-Head Attention (MHA) to fall back to the FP16 Multi-Head Attention (MHA) for better performance.
On QNX, networks that are segmented into a large number of DLA loadables can fail during inference.
The DLA compiler can remove identity transposes but cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshaping). For example, given a TensorRT
IShuffleLayerconsisting of two non-trivial transposes and an identity reshape in between, the shuffle layer is translated into two consecutive DLA transpose layers unless you manually merge the transposes in the model definition in advance.nvinfer1::UnaryOperation::kROUNDornvinfer1::UnaryOperation::kSIGNoperations ofIUnaryLayerare not supported in implicit batch mode.For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset containing the corresponding function ops (for example, opset 17 for LayerNormalization or opset 18 GroupNormalization). Numerical accuracy using function ops is superior to the corresponding implementation with primitive ops for normalization layers.
Weight streaming mainly supports GEMM-based networks like Transformers for now. Convolution-based networks may have only a few weights that can be streamed.
When two convolutions with INT8-QDQ and residual add share the same weight, constant weight fusion does not occur. Make a copy of the shared weight for better performance.
When building the
nonZeroPluginsample on Windows, you might need to modify the CUDA version specified in theBuildCustomizationspaths in thevcxprojfile to match the installed version of CUDA.
Deprecated API Lifetime#
APIs deprecated in TensorRT 10.7 will be retained until 12/2025.
APIs deprecated in TensorRT 10.6 will be retained until 11/2025.
APIs deprecated in TensorRT 10.5 will be retained until 10/2025.
APIs deprecated in TensorRT 10.4 will be retained until 9/2025.
APIs deprecated in TensorRT 10.3 will be retained until 8/2025.
APIs deprecated in TensorRT 10.2 will be retained until 7/2025.
APIs deprecated in TensorRT 10.1 will be retained until 5/2025.
APIs deprecated in TensorRT 10.0 will be retained until 3/2025.
Deprecated and Removed Features#
For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.
Deprecated
kDIRECT_IO. For DLA, use thekGPU_FALLBACKflag to control the IO tensor reformat.Deprecated
IRuntime::deserializeCudaEngine(IStreamReader&). Superseded byIRuntime::deserializeCudaEngine(IStreamReaderV2&)which offers the ability to load host and device memory separately, enabling integration of engine loading code with technology like GDS (GPU Direct Storage).On Blackwell and later platforms, TensorRT will drop cuDNN support on the following categories of plugins:
User-written
IPluginV2Ext,IPluginV2DynamicExt, andIPluginV2IOExtplugins that are dependent on cuDNN handles provided by TensorRT (via theattachToContext()API).TensorRT standard plugins that use cuDNN, specifically:
InstanceNormalization_TRT(versions 1, 2, and 3)GroupNormalizationPlugin(version 1)
Note
These normalization plugins are superseded by TensorRT’s native
INormalizationLayer. TensorRT support for cuDNN-dependent plugins remains unchanged on pre-Blackwell platforms.
Fixed Issues#
Fixed an issue where TensorRT could hang in multi-stream mode when certain kernels were selected and fused on Hopper GPUs due to a warp synchronization bug.
Fixed an issue where
CustomQKVToContextPluginDynamicversion 4 (bertQKVToContextPluginversion 4) raised an internal error if either the batch or sequence dimension differed at runtime from the ones used to serialize the engine.Fixed an issue where
IOneHotLayercould not compute a shape tensor. Previously the builder would fail with a message that anIOneHotLayercannot be used to compute a shape tensor.Fixed a failure when multiple graph input tensors were set as debug tensors by checking whether the tensor has already been decorated as a debug tensor and avoiding performing
insertProcessDebugTensorOpsinIConfigtwice on the same input tensor.
Known Issues#
Functional
Sometimes
IConstantLayerfollowed byICastLayercan cause errors.When running OSS demoBERT FP16 inference on H20 GPUs, different batch sizes may generate different outputs given the same input values. This can be worked around by using a fixed batch size.
Some TF32 convolution tactics may cause a CUDA illegal memory access error if the input or output tensor has more than 2^30 elements. The workaround is to disable TF32 and use a different precision like FP32 or FP16 instead.
There is a known accuracy issue running certain networks on NVIDIA HGX H20.
Inputs to the
IRecurrenceLayermust always have the same shape. This means that ONNX models with loops whose recurrence inputs change shapes will be rejected.If TensorRT 8.6 or 9.x was installed using the Python Package Index (PyPI), you cannot upgrade TensorRT to 10.x using PyPI. You must first uninstall TensorRT using
pip uninstall tensorrt tensorrt-libs tensorrt-bindings, then reinstall TensorRT usingpip install tensorrt. This will remove the previous TensorRT version and install the latest TensorRT 10.x. This step is required because the suffix -cuXX was added to the Python package names, which prevents the upgrade from working properly.CUDA compute sanitizer may report racecheck hazards for some legacy kernels. However, related kernels do not have functional issues at runtime.
The compute sanitizer
initchecktool may flag false positiveUninitialized __global__ memory readerrors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.Multi-Head Attention (MHA) fusion might not happen and affect performance if the number of heads is small.
An occurrence of use-after-free in NVRTC has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.
There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option
--keep-debuginfo=yesto the Valgrind command line to suppress these errors.{ Memory leak errors with dlopen. Memcheck:Leak match-leak-kinds: definite ... fun:*dlopen* ... } { Memory leak errors with nvrtc Memcheck:Leak match-leak-kinds: definite fun:malloc obj:*libnvrtc.so* ... }
SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a
could not find any implementationerror while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.Installing the
cuda-compat-11-4package may interfere with CUDA-enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove thecuda-compat-11-4package or upgrade the driver to r470.For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in
kDLA_LINEARformat. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.Exclusive padding with
kAVERAGEpooling is not supported.Asynchronous CUDA calls are not supported in the user-defined
processDebugTensorfunction for the debug tensor feature due to a bug in Windows 10.inplace_addmini-sample of thequickly_deployable_pluginsPython sample may produce incorrect outputs on Windows. This will be fixed in a future release.
Performance
Up to 30% performance gaps between fused Multi-Head Attention (MHA) kernels built with dynamic sequence lengths versus fused Multi-Head Attention (MHA) kernels built with static sequence lengths when the maximum sequence length is much greater than the optimal sequence length in optimization profiles on Hopper GPUs.
Up to 9% inference performance regression for
StableDiffusionv2.0/2.1 VAE network in FP16 precision on Hopper GPUs compared to TensorRT 10.6 in CUDA 11.8 environment. This issue can be fixed by upgrading CUDA to 12.6.Up to 60% performance regression compared to TensorRT 8.6 on Ampere GPUs for group convolutions with N channels per group, where N is not a power of 2. This can be worked around by padding N to the next power of 2
A performance regression is expected for TensorRT 10.x with respect to TensorRT 8.6 for networks with operations that involve data-dependent shapes, such as non-max suppression or non-zero operations. The amount of regression is roughly proportional to the number of such layers in the network.
Up to 22% context memory size regression for HiFi-GAN networks in INT8 precision compared to TensorRT 10.5 on Ampere GPUs.
Up to 10% performance regression for BERT networks exported from TensorFlow2 in FP16 precision compared to TensorRT 10.4 for BS1 and Seq128 on A16 GPUs.
Up to 16% regression in context memory usage for
StableDiffusionXL VAE network in FP8 precision on H100 GPUs compared to TensorRT 10.3 due to a necessary functional fix.Up to 15% regression in context memory usage for networks containing InstanceNorm and Activation ops compared to TensorRT 10.0.
Up to 15% CPU memory usage regression for mbart-cnn/mamba-370m in FP16 precision and OOTB mode on NVIDIA Ada Lovelace GPUs compared to TensorRT 10.2.
Up to 6% performance regression for BERT/Megatron networks in FP16 precision compared to TensorRT 10.2 for BS1 and Seq128 on H100 GPUs.
Up to 6% performance regression for Bidirectional LSTM in FP16 precision on H100 GPUs compared to TensorRT 10.2.
Up to 25% performance regression when running TensorRT-LLM without the attention plugin. The current recommendation is always to enable the attention plugin when using TensorRT-LLM.
Performance gaps between engines built with REFIT enabled and engines built with REFIT disabled.
Up to 60 MB engine size fluctuations for the BERT-Large INT8-QDQ model on Orin due to unstable tactic selection among tactics.
Up to 16% performance regression for BasicUNet, DynUNet, and HighResNet in INT8 precision compared to TensorRT 9.3.
Up to 40-second increase in engine building for BART networks on NVIDIA Hopper GPUs.
Up to 20-second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.
Up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert-like models due to additional tactics available for evaluation.
Up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.
Up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.
Convolution on a tensor with an implicitly data-dependent shape may run slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.
Up to 5% performance drop for networks using sparsity in FP16 precision.
Up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.
Up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. Enable FP16 and disable INT8 in the builder config to work around this.
In explicitly quantized networks, a group convolution with a Q/DQ pair before but no Q/DQ pair after runs with INT8-IN-FP32-OUT mixed precision. However, NVIDIA Hopper may fall back to FP32-IN-FP32-OUT if the input channel count is small.
The
kREFITandkREFIT_IDENTICALhave performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.