TensorRT 10.10.0 Release Notes#
These Release Notes apply to x86 Linux and Windows users, and ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux. This release includes several fixes from the previous TensorRT releases and additional changes.
Announcements#
Plugin deprecations: Deprecated PluginVersion and PluginCreatorVersion enums. These are only used in relation to IPluginV2-descendent plugins and IPluginCreator-descendent plugin creators, respectively, which have all been deprecated as of TensorRT 10.10.
Key Features and Enhancements#
Large Tensor Support
Enhanced Large Tensor Handling: TensorRT 10.10 enhances support for large tensors, with most layers now capable of handling large volumes. For more information about each layer’s capabilities, refer to the TensorRT Operator documentation.
Performance Optimizations
Blackwell GPU Improvements:
Improved performance for Tanh, Slice, and Concatenate operations
Improved ConvNets performance in INT8 precision
Enabled GEMM+SwiGLU/GeGLU fusion for FP8 and FP16 data types
Improved performance for BF16 and FP16 batched GEMMs with static shapes and small M/N/K dimensions (≤64)
Multi-Head Attention (MHA) Enhancements:
Improved MHA performance and enhanced mask support on Blackwell GPUs
Improved FP8 and BF16 MHA performance on Ada GPUs for long sequence lengths
Improved MHA fusion logic to detect wider variants of MHA patterns. Refer to the Multi-Head Attention Fusion section for more information
Hopper GPU Optimizations: Improved performance for BF16 and FP16 batched GEMMs with static shapes and small M/N/K dimensions (≤64)
Memory Management
Runtime Activation Resize: Added
PreviewFeature::kRUNTIME_ACTIVATION_RESIZE_10_10which allows TensorRT to reduce memory usage by using better estimation of activation memory requirements based on actual input shapes during runtime.
API Enhancements
Plugin Registry API: Added
IPluginRegistry::getAllCreatorsRecursive()which enables TensorRT plugins to obtain a list of all plugin creators registered in TensorRT’s plugin registry.
Builder Improvements
Builder Resource Initialization: Improved performance for TensorRT builder resource initialization on pre-Blackwell platforms.
Bug Fixes and Performance
FP8 MHA Performance: Fixed an issue where FP8 MHA performance was lower than BF16/FP16 MHA on SM89 when the sequence length was long (for example, >100k).
FP4 Quantization: Fixed an issue where the scale factor had to be a build-time constant if QuantizeLayer was used with the output FP4 data type.
Memory Leak Fixes: Fixed memory leak issues when TensorRT builds engines on pre-Blackwell GPUs, including issues reported by the Valgrind Memcheck tool, especially for models with convolution layers.
API Enhancements
API Change Tracking: To view API changes between releases, refer to the TensorRT GitHub repository and use the compare tool.
Compatibility#
TensorRT 10.10.0 has been tested with the following:
PyTorch >= 2.0 (refer to the
requirements.txtfile for each sample)
This TensorRT release supports NVIDIA CUDA:
This TensorRT release requires at least NVIDIA driver r450 on Linux or r452 on Windows as required by CUDA 11.0, which is the minimum CUDA version supported by this TensorRT release. For CUDA 12.x the minimum NVIDIA driver version is r535.
Limitations#
In some rare cases, FP8 MHA on SM90 might have had accuracy issues with sequence length < 256.
There are no optimized FP8 Convolutions for Group Convolutions and Depthwise Convolutions; therefore, INT8 is still recommended for ConvNets containing these convolution ops.
The FP8 Convolutions only support input/output channels that are multiples of 16; otherwise, TensorRT will fall back to non-FP8 convolutions.
The FP8 Convolutions do not support kernel sizes larger than 32 (for example, 7x7 convolutions); FP16 or FP32 fallback kernels will be used with suboptimal performance. Therefore, do not add FP8 Q/DQ ops before Convolutions with large kernel sizes for better performance.
There cannot be any pointwise operations between the first batched GEMM and the softmax inside FP8 MHAs (for example, having an attention mask). This will be improved in future TensorRT releases.
For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset containing the corresponding function ops (for example, opset 17 for LayerNormalization or opset 18 GroupNormalization). Numerical accuracy using function ops is superior to the corresponding implementation with primitive ops for normalization layers.
Weight streaming mainly supports GEMM-based networks like Transformers for now. Convolution-based networks may have only a few weights that can be streamed.
When building the
nonZeroPluginsample on Windows, you might need to modify the CUDA version specified in theBuildCustomizationspaths in thevcxprojfile to match the installed version of CUDA.The weights used in INT4 weights-only quantization (WoQ) cannot be refitted.
The high-precision weights used in FP4 double quantization are not refittable.
Python samples do not support Python 3.13. Only the 3.13 Python bindings are currently supported.
Deprecated API Lifetime#
APIs deprecated in TensorRT 10.10 will be retained until 4/2026.
APIs deprecated in TensorRT 10.9 will be retained until 3/2026.
APIs deprecated in TensorRT 10.8 will be retained until 2/2026.
APIs deprecated in TensorRT 10.7 will be retained until 12/2025.
APIs deprecated in TensorRT 10.6 will be retained until 11/2025.
APIs deprecated in TensorRT 10.5 will be retained until 10/2025.
APIs deprecated in TensorRT 10.4 will be retained until 9/2025.
APIs deprecated in TensorRT 10.3 will be retained until 8/2025.
APIs deprecated in TensorRT 10.2 will be retained until 7/2025.
APIs deprecated in TensorRT 10.1 will be retained until 5/2025.
Deprecated and Removed Features#
For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.
Deprecated
PluginVersionandPluginCreatorVersionenums. These are only used in relation toIPluginV2-descendent plugins andIPluginCreator-descendent plugin creators, respectively, which have all been deprecated as of 10.10.
Fixed Issues#
Fixed an issue where FP8 MHA performance was lower than BF16/FP16 MHA on SM89 when the sequence length was long (for example, >100k).
Fixed an issue where the scale factor had to be a build-time constant if QuantizeLayer was used with the output FP4 data type.
Fixed memory leak issues when TensorRT builds engines on pre-Blackwell GPUs.
Fixed an issue where the Valgrind Memcheck tool could report memory leaks when TensorRT built engines on pre-Blackwell GPUs, especially if models contained convolution layers.
Known Issues#
Functional
There is a known accuracy issue on the Conv layers in the SDXL network on NVIDIA B200.
When running the FLUX Transformer model in 2048x2048 spatial dimensions, it may produce NaN outputs. This can be worked around with different spatial dimensions.
Support for B100 and B200 on Windows is considered experimental. Some networks may fail to run due to missing kernels for this GPU and OS combination. We plan to improve this support in a future release, but its status will remain experimental at this time.
When running OSS demoBERT FP16 inference on H20 GPUs, different batch sizes may generate different outputs given the same input values. This can be worked around by using a fixed batch size.
There is a known accuracy issue running certain networks on NVIDIA HGX H20.
Inputs to the
IRecurrenceLayermust always have the same shape. This means that ONNX models with loops whose recurrence inputs change shapes will be rejected.CUDA compute sanitizer may report racecheck hazards for some legacy kernels. However, related kernels do not have functional issues at runtime.
The compute sanitizer
initchecktool may flag false positiveUninitialized __global__ memory readerrors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.Multi-Head Attention fusion might not happen and affect performance if the number of heads is small.
An occurrence of use-after-free in NVRTC has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.
There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option
--keep-debuginfo=yesto the Valgrind command line to suppress these errors.{ Memory leak errors with dlopen. Memcheck:Leak match-leak-kinds: definite ... fun:*dlopen* ... } { Memory leak errors with nvrtc Memcheck:Leak match-leak-kinds: definite fun:malloc obj:*libnvrtc.so* ... }
SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a
could not find any implementationerror while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.Installing the
cuda-compat-11-4package may interfere with CUDA-enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove thecuda-compat-11-4package or upgrade the driver to r470.For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in
kDLA_LINEARformat. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.Exclusive padding with
kAVERAGEpooling is not supported.Asynchronous CUDA calls are not supported in the user-defined
processDebugTensorfunction for the debug tensor feature due to a bug in Windows 10.inplace_addmini-sample of thequickly_deployable_pluginsPython sample may produce incorrect outputs on Windows. This will be fixed in a future release.When linking with
libcudart_static.ausing a RedHatgcc-toolset-11or earlier compiler, you may encounter an issue where exception handling isn’t working. When a throw or exception happens, the catch is ignored, and an abort is raised, killing the program. This may be related to a linker bug causing theeh_frame_hdrELF segment to be empty. You can workaround this issue using a new linker, such as the one fromgcc-toolset-13.TensorRT may exit if inputs with invalid values are provided to the
RoiAlignplugin (ROIAlign_TRT), especially if there is inconsistency in the indices specified in thebatch_indicesinput and the actual batch size used.In the Validate against Ground Truth section of the efficientnet samples, the link to download Caffe’s ILSVRC2012 auxiliary package is unstable. Therefore, the download might fail intermittently.
The ONNX specification of the
NonMaxSuppressionoperation requires theiou_thresholdparameter to be in the range of[0.0-1.0]. However, TensorRT does not validate the value of the parameter; therefore, TensorRT will accept values outside of this range, in which case, the engine will continue executing as if the value was capped at either end of this range.
Performance
There is an accuracy issue when running demo/Diffusion SDXL+ControlNet on B200 GPUs.
Up to 40% ExecutionContext memory regression compared to TensorRT 10.9 for some networks with FP16 precision in NVIDIA Ada and Hopper GPUs.
Up to 20% performance regression compared to TensorRT 10.8 for networks with concatenation nodes that have 100+ inputs. This will be fixed in TensorRT 10.11.
Up to 16% performance regression compared to TensorRT 10.9 for networks with Conv+LeakyReLU, Conv+Switch, and Conv+GeLU in TF32 and FP16 precisions on SM120 Blackwell GPUs. This will be fixed in TensorRT 10.11.
Up to 6% performance regression compared to TensorRT 10.9 for ConvNext in INT8 precision on Hopper and Ampere GPUs.
Up to 26% performance regression for a particular version of GPT-2 which has a large concatenation at the end of the network.
CPU peak memory usage regression with the
roberta_baseengine on Ampere GPUs compared to TensorRT 10.7.Up to 10% performance regression for Megatron networks in FP32 precision compared to TensorRT 10.8 for BS4.
Up to 100 MB context memory size regression compared to TensorRT 8.6 on Hopper GPUs for CRNN (Convolutional Recurrent Neural Network) models. Inference performance is not affected.
Up to 9% inference performance regression for
StableDiffusionv2.0/2.1 VAE network in FP16 precision on Hopper GPUs compared to TensorRT 10.6 in CUDA 11.8 environment. This issue can be fixed by upgrading CUDA to 12.6.Up to 60% performance regression compared to TensorRT 8.6 on Ampere GPUs for group convolutions with N channels per group, where N is not a power of 2. This can be worked around by padding N to the next power of 2
Up to 22% context memory size regression for HiFi-GAN networks in INT8 precision compared to TensorRT 10.5 on Ampere GPUs.
Up to 7% performance regression for Megatron networks in FP16 precision compared to TensorRT 10.6 for BS1 and Seq128 on H100 GPUs.
Up to 10% performance regression for BERT networks exported from TensorFlow2 in FP16 precision compared to TensorRT 10.4 for BS1 and Seq128 on A16 GPUs.
Up to 16% regression in context memory usage for
StableDiffusionXL VAE network in FP8 precision on H100 GPUs compared to TensorRT 10.3 due to a necessary functional fix.Up to 15% regressing in context memory usage for networks containing InstanceNorm and Activation ops compared to TensorRT 10.0.
Up to 15% CPU memory usage regression for mbart-cnn/mamba-370m in FP16 precision and OOTB mode on NVIDIA Ada Lovelace GPUs compared to TensorRT 10.2.
Up to 6% performance regression for BERT/Megatron networks in FP16 precision compared to TensorRT 10.2 for BS1 and Seq128 on H100 GPUs.
Up to 6% performance regression for Bidirectional LSTM in FP16 precision on H100 GPUs compared to TensorRT 10.2.
Performance gaps between engines built with REFIT enabled and engines built with REFIT disabled.
Up to 60 MB engine size fluctuations for the BERT-Large INT8-QDQ model on Orin due to unstable tactic selection among tactics.
Up to 16% performance regression for BasicUNet, DynUNet, and HighResNet in INT8 precision compared to TensorRT 9.3.
Up to 40-second increase in engine building for BART networks on NVIDIA Hopper GPUs.
Up to 20-second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.
Up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert-like models due to additional tactics available for evaluation.
Up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.
Up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.
Convolution on a tensor with an implicitly data-dependent shape may run slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.
Up to 5% performance drop for networks using sparsity in FP16 precision.
Up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.
Up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. Enable FP16 and disable INT8 in the builder config to work around this.
In explicitly quantized networks, a group convolution with a Q/DQ pair before but no Q/DQ pair after runs with INT8-IN-FP32-OUT mixed precision. However, NVIDIA Hopper may fall back to FP32-IN-FP32-OUT if the input channel count is small.
The
kREFITandkREFIT_IDENTICALhave performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.