Release Notes#
These Release Notes describe the key features, software enhancements and improvements, and known issues for the TensorRT release product package.
To review the TensorRT 10.7.0 and earlier documentation, refer to the TensorRT Archived Documentation.
To review the TensorRT 10.8.0 and later documentation, choose a version from the bottom left navigation selector toggle.
TensorRT 10.14.1#
These are the TensorRT 10.14.1 Release Notes, which apply to x86 Linux and Windows users, and Arm-based CPU cores for Server Base System Architecture (SBSA) users on Linux. This release includes several fixes from the previous TensorRT releases and additional changes.
Announcements
Support for NVIDIA GB300, NVIDIA DGX B300, and NVIDIA DGX Spark have been added in this release. This TensorRT release is expected to be functionally complete and fully performant for these GPUs.
The TensorRT package no longer includes samples and their data. Refer to the TensorRT GitHub repository to retrieve and build samples.
Key Features and Enhancements
The API Capture and Replay tools can now be used on AArch64 platforms.
Added API IAttention to allow users to add an attention operator that runs a fused attention kernel.
Added
SerializationFlag::kINCLUDE_REFITto ensure that the serialized engine remains refittable. When serializing a weight-stripping engine withoutSerializationFlag::kEXCLUDE_WEIGHTS, the resulting serialized engine is not refittable by default.For the
Topk,NMS, andNonZerooperations, new APIs have been introduced to control the data type of the output indices, allowing users to specify whether the indices should be INT32 or INT64. Specifically, the newTopk::setIndicesTypeandTopk::getIndicesType(similarly forNMSandNonZero) APIs enable setting and retrieving the indices data type. Additionally, new versions ofAddTopk,AddNMS, andAddNonZeroAPIs have been introduced with an extra parameter for specifying the indices data type.Enhanced multi-head attention fusions when the head size does not meet the alignment requirements by padding it automatically.
The builder resources (
libnvinfer_builder_resource.soon Linux andnvinfer_builder_resource.dllon Windows) are partitioned according to the architecture to reduce memory usage during engine build. Each partitioned builder resource contains cubins for a single architecture only. Additionally, there is a separate builder resource containing PTX code for hardware forward compatibility serialization. During engine build, only the builder resource corresponding to the architecture of the profile device is loaded.The number of synchronous memory allocations have been reduced for performance improvements.
Breaking ABI Changes
The object files, and therefore the symbols, inside the static library
libonnx_proto.ahave been merged into thelibnvonnxparser_static.astatic library. A symlink has been created for backward compatibility. Migrate your build to use the dynamic librarylibnvonnxparser.so. The static librarylibonnx_proto.aas well aslibnvonnxparser_static.awill be removed in TensorRT 11.0.The TensorRT Windows library files, with extension
*.dll, were previously located under thelibsubdirectory within the TensorRT zip package. These files are now located under thebinsubdirectory within the TensorRT zip package, which is a more common packaging schema for Windows.
Compatibility
TensorRT 10.14.1 has been tested with the following:
PyTorch >= 2.0 (refer to the
requirements.txtfile for each sample)
This TensorRT release supports NVIDIA CUDA:
This TensorRT release requires at least NVIDIA driver r535 on Linux or r537 on Windows as required by CUDA 12.x, which is the minimum CUDA version supported by this TensorRT release. For CUDA 13.x the minimum NVIDIA driver version is r580.
Limitations
Shuffle-op can not be transformed to no-op for perf improvement in some cases. For the NCHW32 format, TensorRT takes the third-to-last dimension as the channel dimension. When a Shuffle-op is added like [N, ‘C’, H, 1] -> [‘N’, C, H], the channel dimension changes to N, then this op can not be transformed to no-op.
When running a FP32 model in FP16 or BF16 WeaklyTyped mode on Blackwell GPUs, if the FP32 weights values are used by FP16 kernels, TensorRT will not clip the weights to
[fp16_lowest, fp16_max]or[bf16_lowest, bf16_max]to avoid overflow likeinfvalues. If you seeinfgraph outputs on Blackwell GPUs only, check if any FP32 weights cannot be represented by either FP16 or BF16, and update the weights.There are no optimized FP8 Convolutions for Group Convolutions and Depthwise Convolutions. Therefore, INT8 is still recommended for ConvNets containing these convolution ops.
The FP8 Convolutions do not support kernel sizes larger than 32, such as 7x7 convolutions, and FP16 or FP32 fallback kernels will be used with suboptimal performance. Therefore, do not add FP8 Q/DQ ops before Convolutions with large kernel sizes for better performance.
On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
The DLA compiler can remove identity transposes but cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshaping). For example, given a TensorRT
IShuffleLayerconsisting of two non-trivial transposes and an identity reshape in between, the shuffle layer is translated into two consecutive DLA transpose layers unless the user merges the transposes manually in the model definition in advance.nvinfer1::UnaryOperation::kROUNDornvinfer1::UnaryOperation::kSIGNoperations ofIUnaryLayerare not supported in the implicit batch mode.For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset containing the corresponding function ops, such as opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to the corresponding implementation with primitive ops for normalization layers.
When two convolutions with INT8-QDQ and residual add share the same weight, constant weight fusion will not occur. Make a copy of the shared weight for better performance.
When building the
nonZeroPluginsample on Windows, you may need to modify the CUDA version specified in theBuildCustomizationspaths in thevcxprojfile to match the installed version of CUDA.The weights used in INT4 weights-only quantization (WoQ) cannot be refitted.
The high-precision weights used in FP4 double quantization are not refittable.
Python samples do not support Python 3.13. Only the 3.13 Python bindings are currently supported.
Python samples that require PyCUDA do not support CUDA 13.x.
Loops with scan outputs (
ILoopOutputLayerwithLoopOutputproperty being eitherLoopOutput::kCONCATENATEorLoopOutput::kREVERSE) must have the number of iterations set, that is, must have anITripLimitLayerwithTripLimit::kCOUNT. This requirement has always been present, but is now explicitly enforced instead of quietly having undefined behavior.Fused attention is not supported for
IAttentionin INT8 and FP8 with custom masks with dynamic sequence lengths common value < 4096 andIAttentionin FP16 and BF16 with custom masks for performance reasons on SM100, SM103, and SM110. SetIAttentiontodecomposable.ISelectLayermust have data inputs (thenInputandelseInput) of the same datatype.
Deprecated API Lifetime
APIs deprecated in TensorRT 10.14 will be retained until 10/2026.
APIs deprecated in TensorRT 10.13 will be retained until 8/2026.
APIs deprecated in TensorRT 10.12 will be retained until 6/2026.
APIs deprecated in TensorRT 10.11 will be retained until 5/2026.
APIs deprecated in TensorRT 10.10 will be retained until 4/2026.
APIs deprecated in TensorRT 10.9 will be retained until 3/2026.
APIs deprecated in TensorRT 10.8 will be retained until 2/2026.
APIs deprecated in TensorRT 10.7 will be retained until 12/2025.
APIs deprecated in TensorRT 10.6 will be retained until 11/2025.
Refer to the API documentation (C++, Python) for instructions on updating your code to remove the use of deprecated features.
Deprecated and Removed Features
The following features have been deprecated or removed in TensorRT 10.14.1.
The TensorRT static libraries are deprecated on Linux starting with TensorRT 10.11. If you are using the static libraries for building your application, migrate to building your application with the shared libraries. The following library files will be removed in TensorRT 11.0.
libnvinfer_static.alibnvinfer_plugin_static.alibnvinfer_lean_static.alibnvinfer_dispatch_static.alibnvinfer_vc_plugin_static.alibnvonnxparser_static.alibonnx_proto.a
The old
AddTopK,AddAMS, andAddNonZeroAPIs have been deprecated. Use the newAddTopK,AddNMS, andAddNonZeroAPIs instead, which include an additional parameter to specify the data type of the output indices.
Fixed Issues
DynamicQuantizedid not support use cases where the batch dimension exceededINT32_MAX.There was an up to 78% performance regression on RTX PRO 6000 Blackwell Server Edition for densenet121_opt9 explicit quantization with FP8.
There was an up to 77% performance regression compared to TensorRT 10.12 for QuartzNet networks in FP8 precision on Blackwell GPUs with compute capability 12.0.
When running PluginV3 on Blackwell GPUs, for the data types of the plugin inputs, TensorRT may do casting between quantized data types and non-quantized data types. For example, if a tactic claims to support INT8 IO while the IO tensors are FP32, the generated engine could be
t1_f32 - Cast(i8) - plugin_tactic - Cast(f32) - t2_f32. The work around is to disable certain tactics in the plugin. This issue has been fixed in this release.There was an up to 40% performance regression on GB200 for ViT models in multi-head attention when using CUDA 13.0 compared to CUDA 12.9 and TensorRT 10.13.0.
There was an up to 120 MB context memory regression compared to TensorRT 10.12 for FLUX in BF16 precision on Hopper GPUs.
On TensorRT Auto, there was a mismatch between the symbols exposed by the TensorRT libraries and the TensorRT Python bindings related to the Ahead-of-Time Quickly Deployable Plugin (AOT QDP feature). This issue has been fixed.
MyelinCheckExceptionmay have been reported when Slice-Fill-Conv was used on Blackwell GPUs.Fused multi-head attention kernels in half precision with a custom mask may have experienced up to a 55% performance regression with CUDA 13.0 compared to CUDA 12.9 on Blackwell GPUs.
Known Issues
Functional
Multiple pointwise inputs cannot be supported by the Fused Multi-Head Attention implementation.
Valgrind may show
Invalid read of size 8when callingcuMemcpyDtoHAsync_v2and using CUDA 13.0 on edge Blackwell devices.When running the FLUX Transformer model in 2048x2048 spatial dimensions, it may produce NaN outputs. This can be worked around with different spatial dimensions or switching from FP16 to BF16 precision.
Support for B200 and B300 GPUs on Windows is considered experimental. Some networks may fail to run due to missing kernels for this GPU and OS combination. We plan to improve this support in a future release, but its status will remain experimental at this time.
Inputs to the
IRecurrenceLayermust always have the same shape. This means that ONNX models with loops whose recurrence inputs change shapes will be rejected.CUDA compute sanitizer may report racecheck hazards for some legacy kernels. However, related kernels do not have functional issues at runtime.
The compute sanitizer
initchecktool may flag false positiveUninitialized __global__ memory readerrors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.Multihead attention fusion might not happen and affect performance if the number of heads is small.
An occurrence of use-after-free in NVRTC has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.
There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option
--keep-debuginfo=yesto the Valgrind command line to suppress these errors.{ Memory leak errors with dlopen. Memcheck:Leak match-leak-kinds: definite ... fun:*dlopen* ... } { Memory leak errors with nvrtc Memcheck:Leak match-leak-kinds: definite fun:malloc obj:*libnvrtc.so* ... }
SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a
could not find any implementationerror while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in
kDLA_LINEARformat. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.Exclusive padding with
kAVERAGEpooling is not supported.Asynchronous CUDA calls are not supported in the user-defined
processDebugTensorfunction for the debug tensor feature due to a bug in Windows 10.inplace_addmini-sample of thequickly_deployable_pluginsPython sample may produce incorrect outputs on Windows. This will be fixed in a future release.When linking with
libcudart_static.ausing a RedHatgcc-toolset-11or earlier compiler, you may encounter an issue where exception handling isn’t working. When a throw or exception happens, the catch is ignored, and an abort is raised, killing the program. This may be related to a linker bug causing theeh_frame_hdrELF segment to be empty. You can workaround this issue using a new linker, such as the one fromgcc-toolset-13.TensorRT may exit if inputs with invalid values are provided to the
RoiAlignplugin (ROIAlign_TRT), especially if there is inconsistency in the indices specified in thebatch_indicesinput and the actual batch size used.In the Validate against Ground Truth section of the efficientnet samples, the link to download Caffe’s ILSVRC2012 auxiliary package is unstable. Therefore, the download might fail intermittently.
The ONNX specification of the
NonMaxSuppressionoperation requires theiou_thresholdparameter to be in the range of[0.0-1.0]. However, TensorRT does not validate the value of the parameter; therefore, TensorRT will accept values outside of this range, in which case, the engine will continue executing as if the value was capped at either end of this range.PluginV2 in a loop or conditional scope is not supported. Upgrade to the PluginV3 interface as a WAR. This will impact some TensorRT-LLM models with GEMM plugins in a conditional scope.
There is a known host memory leak issue when building TensorRT engines on NVIDIA Blackwell GPUs.
Performance
Up to 40% performance regression with
set_input_shapefrom Python binding.A non-zero
tilingOptimizationLevelmight introduce engine build failures for some networks on L4 GPUs.PluginV3 AOT compilation fails with
single_tacticmode due to missing tactic implementation.Up to 9% performance regression on B300 compared to B200 for FLUX with FP16 precision. This regression does not exist for FLUX in FP8 or NVFP4 precisions.
Up to 25% performance regression on GB200 for ConvNets with GlobalAveragePool operation like EfficientNet.
Up to 10% performance regression on GB200 for BERT in FP16 precision.
Up to 18% performance regression on GB200 for ConvNets using multiple auxiliary streams.
Up to 24% performance regression on GB200 for ResNext-50 FP8 models when using CUDA 13.0 compared to CUDA 12.9 and TensorRT 10.13.0.
Up to 6% performance regression compared to TensorRT 10.9 for ConvNext in INT8 precision on Hopper and Ampere GPUs.
CPU peak memory usage regression with the
roberta_baseengine on Ampere GPUs compared to TensorRT 10.7.Up to 10% performance regression for Megatron networks in FP32 precision compared to TensorRT 10.8 for BS4.
Up to 100 MB context memory size regression compared to TensorRT 8.6 on Hopper GPUs for CRNN (Convolutional Recurrent Neural Network) models. Inference performance is not affected.
Up to 9% inference performance regression for
StableDiffusionv2.0/2.1 VAE network in FP16 precision on Hopper GPUs compared to TensorRT 10.6 in CUDA 11.8 environment. This issue can be fixed by upgrading CUDA to 12.6.Up to 60% performance regression compared to TensorRT 8.6 on Ampere GPUs for group convolutions with N channels per group, where N is not a power of 2. This can be worked around by padding N to the next power of 2
Up to 22% context memory size regression for HiFi-GAN networks in INT8 precision compared to TensorRT 10.5 on Ampere GPUs.
Up to 7% performance regression for Megatron networks in FP16 precision compared to TensorRT 10.6 for BS1 and Seq128 on H100 GPUs.
Up to 10% performance regression for BERT networks exported from TensorFlow2 in FP16 precision compared to TensorRT 10.4 for BS1 and Seq128 on A16 GPUs.
Up to 16% regression in context memory usage for
StableDiffusionXL VAE network in FP8 precision on H100 GPUs compared to TensorRT 10.3 due to a necessary functional fix.Up to 15% regressing in context memory usage for networks containing InstanceNorm and Activation ops compared to TensorRT 10.0.
Up to 15% CPU memory usage regression for mbart-cnn/mamba-370m in FP16 precision and OOTB mode on NVIDIA Ada Lovelace GPUs compared to TensorRT 10.2.
Up to 6% performance regression for BERT/Megatron networks in FP16 precision compared to TensorRT 10.2 for BS1 and Seq128 on H100 GPUs.
Up to 6% performance regression for Bidirectional LSTM in FP16 precision on H100 GPUs compared to TensorRT 10.2.
Performance gaps between engines built with REFIT enabled and engines built with REFIT disabled.
Up to 60 MB engine size fluctuations for the BERT-Large INT8-QDQ model on Orin due to unstable tactic selection among tactics.
Up to 16% performance regression for BasicUNet, DynUNet, and HighResNet in INT8 precision compared to TensorRT 9.3.
Up to 40-second increase in engine building for BART networks on NVIDIA Hopper GPUs.
Up to 20-second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.
Up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert-like models due to additional tactics available for evaluation.
Up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.
Up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.
Convolution on a tensor with an implicitly data-dependent shape may run slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.
Up to 5% performance drop for networks using sparsity in FP16 precision.
Up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.
Up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. Enable FP16 and disable INT8 in the builder config to work around this.
In explicitly quantized networks, a group convolution with a Q/DQ pair before but no Q/DQ pair after runs with INT8-IN-FP32-OUT mixed precision. However, NVIDIA Hopper may fall back to FP32-IN-FP32-OUT if the input channel count is small.
The
kREFITandkREFIT_IDENTICALhave performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.