TensorRT 10.4.0 Release Notes#
These Release Notes apply to x86 Linux and Windows users, ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux, and JetPack users. This release includes several fixes from the previous TensorRT releases and additional changes.
Announcements#
Platform Support Updates: Added support for Ubuntu 24.04 LTS on Linux x86_64 and SBSA platforms, expanding deployment options for the latest Ubuntu release.
NVIDIA Volta Deprecation: NVIDIA Volta support (GPUs with compute capability 7.0) is deprecated starting with TensorRT 10.0 and will be removed in TensorRT 10.5. Plan migration to supported GPU architectures as detailed in the TensorRT Support Matrix.
Plugin Migration: Several IPluginV2-descendent plugin versions have been deprecated in favor of IPluginV3 versions.
Key Features and Enhancements#
Performance Optimizations
LLM Build Time Improvements: Significantly improved engine building time for Large Language Model (LLM) models in FP16 and FP8 precisions, reducing deployment time for transformer-based architectures.
Vision Transformer FP8 Performance: Enhanced performance of various vision transformers (including ViT, Swin Transformers, DiT, and FasterViT) in FP8 precision on Hopper GPUs when using the Model Optimizer
onnx_ptqquantization tool.Stable Diffusion v3 Optimization: Improved performance of mmDiT (multimodal Diffusion Transformer) in Stable Diffusion v3 in FP16 precision on Hopper GPUs.
ONNX Operator Support
Window Functions: Added support for BlackmanWindow, HannWindow, and HammingWindow ONNX operators, expanding signal processing capabilities for audio and time-series models.
Samples and Tools
Aliased I/O Plugin Sample: Added a new Python sample
aliased_io_plugindemonstrating how to use TensorRT plugins with aliased I/O to realize in-place updates to input tensors, enabling memory-efficient custom operations.
Bug Fixes and Performance
FP8 Convolution Support: Fixed an engine build failure when FP8 Q/DQ ops were added before convolution operations with input/output channels not multiples of 16.
GEMV Accuracy: Fixed an accuracy issue when networks contained two consecutive GEMV operations (MatrixMultiply with
gemmMorgemmN == 1).INT8 QDQ Performance: Fixed up to 6% performance regression for BERT/Megatron networks with INT8 QDQ on Ampere and Hopper GPUs.
Vision Model Performance: Fixed up to 8% performance regression for PilotNet networks in FP16 precision on Orin and H100 GPUs, and up to 560% regression in TF32/FP32 precisions.
Random Operations: Fixed up to 300% performance regression for networks containing RandomUniform operations.
Transformer INT8 Performance: Fixed performance drops in INT8 precision for Transformer models including ViT, Swin-Transformer, and DETR.
Python 3.11/3.12 Support: Fixed BERT demo support on Python 3.11 and enabled non_zero_plugin and python_plugin samples on Python 3.12.
AArch64 Static Library: Fixed the
libnvonnxparser_static.astatic library for AArch64 platforms, which previously contained mixed x86_64 and AArch64 object files.PyPI Installation: Fixed an issue where the loader could not find the lean library when deserializing version-compatible engines after installing via PyPI.
INT4 Tensor Binding: Fixed data corruption when binding an INT4 tensor as a network output.
CUDA 12.6 Update: Resolved memory leak on L4T by updating to CUDA 12.6 from CUDA 12.4.
API Enhancements
API Change Tracking: To view API changes between releases, refer to the TensorRT GitHub repository and use the compare tool.
Compatibility#
TensorRT 10.4.0 has been tested with the following:
PyTorch >= 2.0 (refer to the
requirements.txtfile for each sample)
This TensorRT release supports NVIDIA CUDA:
This TensorRT release requires at least NVIDIA driver r450 on Linux or r452 on Windows as required by CUDA 11.0, the minimum CUDA version supported by this TensorRT release.
Limitations#
There are no optimized FP8 Convolutions for Group Convolutions and Depthwise Convolutions. Therefore, INT8 is still recommended for ConvNets containing these convolution ops.
The FP8 Convolutions only support input/output channels, which are multiples of 16. Otherwise, TensorRT will fall back to non-FP8 convolutions.
The FP8 Convolutions do not support kernel sizes larger than 32 (for example, 7x7 convolutions); FP16 or FP32 fallback kernels will be used with suboptimal performance. Therefore, do not add FP8 Q/DQ ops before Convolutions with large kernel sizes for better performance.
The accumulation
dtypefor the batched GEMMS in the FP8 MHA must be in FP32.This can be achieved by adding Cast (to FP32) ops before the batched GEMM and Cast (to FP16) after the batched GEMM.
Alternatively, you can convert your ONNX model using TensorRT Model Optimizer, which adds the Cast ops automatically.
There cannot be any pointwise operations between the first batched GEMM and the softmax inside FP8 MHAs (for example, having an attention mask). This will be improved in future TensorRT releases.
The FP8 MHA fusions only support head sizes being multiples of 16. If the MHA has a head size that is not a multiple of 16, do not add Q/DQ ops in the MHA to fall back to the FP16 MHA for better performance.
On QNX, networks that are segmented into a large number of DLA loadables can fail during inference.
The DLA compiler can remove identity transposes but cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshaping). For example, given a TensorRT
IShuffleLayerconsisting of two non-trivial transposes and an identity reshape in between, the shuffle layer is translated into two consecutive DLA transpose layers unless you manually merge the transposes in the model definition in advance.nvinfer1::UnaryOperation::kROUNDornvinfer1::UnaryOperation::kSIGNoperations ofIUnaryLayerare not supported in the implicit batch mode.For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset containing the corresponding function ops (for example, opset 17 for LayerNormalization or opset 18 GroupNormalization). Numerical accuracy using function ops is superior to the corresponding implementation with primitive ops for normalization layers.
Weight streaming mainly supports GEMM-based networks like Transformers for now. Convolution-based networks may have only a few weights that can be streamed.
Deprecated INT8 implicit quantization and calibrator APIs including
dynamicRangeIsSet,CalibrationAlgoType,IInt8Calibrator,IInt8EntropyCalibrator,IInt8EntropyCalibrator2,IInt8MinMaxCalibrator,IInt8Calibrator,setInt8Calibrator,getInt8Calibrator,setCalibrationProfile,getCalibrationProfile,setDynamicRange,getDynamicRangeMin,getDynamicRangeMax, andgetTensorsWithDynamicRange. They may not give the optimal performance and accuracy. As a workaround, use INT8 explicit quantization instead.When two convolutions with INT8-QDQ and residual add share the same weight, constant weight fusion does not occur. Make a copy of the shared weight for better performance.
Deprecated API Lifetime#
APIs deprecated in TensorRT 10.4 will be retained until 9/2025.
APIs deprecated in TensorRT 10.3 will be retained until 8/2025.
APIs deprecated in TensorRT 10.2 will be retained until 7/2025.
APIs deprecated in TensorRT 10.1 will be retained until 5/2025.
APIs deprecated in TensorRT 10.0 will be retained until 3/2025.
Deprecated and Removed Features#
For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.
Deprecated NVIDIA Volta support (GPUs with compute capability 7.0) starting with TensorRT 10.0. Volta support will be removed in TensorRT 10.5.
IPluginV2-descendent versions of the following plugins have been deprecated in favor ofIPluginV3versions, which preserve the attributes and I/O characteristics.Plugin
Deprecated Version
Superseded with Version
CustomSkipLayerNormPluginDynamic1
2
3
4
5
6
7
8
CustomEmbLayerNormPluginDynamic2
3
4
5
Fixed Issues#
There was a known engine build failure if FP8-Q/DQ ops were added before a convolution op whose input/output channels were not multiples of 16. This issue has been fixed.
A known accuracy issue existed when the network contained two consecutive GEMV operations (MatrixMultiply with
gemmMorgemmN == 1). This issue has been fixed.Fixed an up to 6% performance regression for BERT/Megatron networks with INT8 QDQ compared to TensorRT 10.2 on Ampere and Hopper GPUs.
Fixed an up to 8% performance regression for PilotNet networks in FP16 precision on Orin and H100 GPUs compared to TensorRT 10.2, and up to 560% regression in TF32/FP32 precisions.
Fixed an up to 300% performance regression for networks containing RandomUniform ops compared to TensorRT 8.2.
Fixed an up to 12% performance regression for deep recommender networks in TF32 precision on H100 GPUs compared to TensorRT 10.2.
For some Transformer models, including ViT, Swin-Transformer, and DETR, there was a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision. This issue has been fixed.
The BERT demo was unsupported on Python 3.11 environments. This issue has been fixed.
The Python samples non_zero_plugin and python_plugin are now supported in Python 3.12 environments.
A known accuracy issue existed in network patterns fc-xelu-bias and conv-xelu-bias (when bias operation comes after xelu). This issue has been fixed.
The
libnvonnxparser_static.astatic library included within AArch64 tar packages previously contained a mix of x86_64 and AArch64 object files, which prevented the ONNX parser static library from being used on non-x86 platforms. This issue has been fixed and only AArch64 object files are now included.When installing TensorRT using the Python wheels hosted on PyPI, the loader could not find the lean library while deserializing version-compatible engines. This issue has been fixed and you no longer need to set
LD_LIBRARY_PATHto the location of these libraries.Fixed an issue where binding an INT4 tensor as a network output caused data corruption.
There was a memory leak on L4T with CUDA 12.4 due to a known driver issue. This issue is now fixed since we now use CUDA 12.6 instead.
With CUDA 12.5 on Windows,
fcPlugin (CustomFCPluginDynamic)resulted in CUDA errors on certain GPUs. This issue has been fixed.
Known Issues#
Functional
The size of the compilation cache may increase or decrease slightly for the same network.
The supported I/O formats for the
getTensorFormatDescAPI are incorrect. This means that the comments for theTensorFormat enumfields might be incorrect, and thatgetFormatDescStrmight indicateUnknown formatfor a supported combination, or falsely indicate that a combination is valid.Inputs to the
IRecurrenceLayermust always have the same shape. This means that ONNX models with loops whose recurrence inputs change shapes will be rejected.If TensorRT 8.6 or 9.x was installed using the Python Package Index (PyPI), you cannot upgrade TensorRT to 10.x using PyPI. You must first uninstall TensorRT using
pip uninstall tensorrt tensorrt-libs tensorrt-bindings, then reinstall TensorRT usingpip install tensorrt. This will remove the previous TensorRT version and install the latest TensorRT 10.x. This step is required because the suffix -cuXX was added to the Python package names, which prevents the upgrade from working properly.CUDA compute sanitizer may report racecheck hazards for some legacy kernels. However, related kernels do not have functional issues at runtime.
The compute sanitizer
initchecktool may flag false positiveUninitialized __global__ memory readerrors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.Multi-Head Attention fusion might not happen and affect performance if the number of heads is small.
An occurrence of use-after-free in NVRTC has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.
There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option
--keep-debuginfo=yesto the Valgrind command line to suppress these errors.{ Memory leak errors with dlopen. Memcheck:Leak match-leak-kinds: definite ... fun:*dlopen* ... } { Memory leak errors with nvrtc Memcheck:Leak match-leak-kinds: definite fun:malloc obj:*libnvrtc.so* ... }
SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a
could not find any implementationerror while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.Installing the
cuda-compat-11-4package may interfere with CUDA-enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove thecuda-compat-11-4package or upgrade the driver to r470.For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in
kDLA_LINEARformat. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.Exclusive padding with
kAVERAGEpooling is not supported.Asynchronous CUDA calls are not supported in the user-defined
processDebugTensorfunction for the debug tensor feature due to a bug in Windows 10.
Performance
There is an up to 10% performance regression for the TensorRT-LLM gptj_6b model when the attention plugin is disabled compared to TensorRT 10.3. Enable the attention plugin to work around the regression.
Building engines with the same network twice using the same timing cache may result in a size increase in the timing cache
There is an up to 16% regression in context memory usage for StableDifussion XL VAE network in FP8 precision on H100 GPUs compared to TensorRT 10.3.
There is an up to 15% regressing in context memory usage for networks containing InstanceNorm and Activation ops compared to TensorRT 10.0.
There is an up to 10% inference performance regression for Temporal Fusion Transformers compared to TensorRT 10.3 on Hopper GPUs.
There is an up to 12% inference performance regression for DeBERTa networks compared to TensorRT 10.3 on Ampere GPUs.
Up to 45% build time regression for mamba_370m in FP16 precision and OOTB mode on NVIDIA Ada Lovelace GPUs compared to TensorRT 10.2.
Up to 15% CPU memory usage regression for mbart-cnn/mamba-370m in FP16 precision and OOTB mode on NVIDIA Ada Lovelace GPUs compared to TensorRT 10.2.
Up to 6% performance regression for BERT/Megatron networks in FP16 precision compared to TensorRT 10.2 for BS1 and Seq128 on H100 GPUs.
Up to 6% performance regression for Bidirectional LSTM in FP16 precision on H100 GPUs compared to TensorRT 10.2.
Up to 25% performance regression when running TensorRT-LLM without the attention plugin. The current recommendation is to always enable the attention plugin when using TensorRT-LLM.
There are known performance gaps between engines built with REFIT enabled and engines built with REFIT disabled.
Up to 60 MB engine size fluctuations for the BERT-Large INT8-QDQ model on Orin due to unstable tactic selection among tactics.
Up to 16% performance regression for BasicUNet, DynUNet, and HighResNet in INT8 precision compared to TensorRT 9.3.
Up to 40-second increase in engine building for BART networks on NVIDIA Hopper GPUs.
Up to 20-second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.
Up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert-like models due to additional tactics available for evaluation.
Up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.
Up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.
Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.
Up to 5% performance drop for networks using sparsity in FP16 precision.
Up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.
Up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. Enable FP16 and disable INT8 in the builder config to work around this.
In explicitly quantized networks, a group convolution with a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, NVIDIA Hopper may fall back to FP32-IN-FP32-OUT if the input channel count is small.
The
kREFITandkREFIT_IDENTICALhave performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.