TensorRT 10.15.1 Release Notes#
These Release Notes apply to x86 Linux and Windows users, ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux, and JetPack users. This release includes several fixes from the previous TensorRT releases and additional changes.
Announcements#
TensorRT 11.0 is coming soon with powerful new capabilities designed to accelerate your AI inference workflows:
Enhanced Developer Experience: Improved ease of use and seamless integration with PyTorch and Hugging Face ecosystems
Optimized for High-Growth Workloads: Stronger performance alignment across edge, automotive, and data center deployments
Modernized API: To streamline development, TensorRT 11.0 will remove legacy APIs including Weakly-typed APIs, Implicit INT8 quantization, IPluginV2, and TREX
Action Required: We recommend migrating early for Strongly Typed Networks, Explicit Quantization, IPluginV3, and Nsight Deep Learning Designer.
Breaking packaging changes that may require updates to your build and deployment scripts:
Linux:
trtexecand other executables are now installed to/usr/bin(previously/usr/src/tensorrt/bin/) and are added to the systemPATHby default. Symlinks are provided for backward compatibility.Windows: TensorRT library files (
*.dll) are now located under thebinsubdirectory (previouslylib) within the TensorRT zip package.Static libraries on Linux (
libnvinfer_static.a,libnvonnxparser_static.a, etc.) are deprecated starting with TensorRT 10.11 and will be removed in TensorRT 11.0. Migrate to shared libraries.
Python Packaging Changes: Python 3.9 and older have reached end-of-life. To improve Python compatibility with upstream PyPI packages and the TensorRT Python samples, the RPM packages for RHEL/Rocky Linux 8 and RHEL/Rocky Linux 9 now depend on Python 3.12.
Platform Support: Debian 12 is supported for the Server Base System Architecture (SBSA) platform starting with the TensorRT 10.15 release.
Key Features and Enhancements#
Transformer and LLM Optimizations
KV Cache Reuse: Added
KVCacheUpdateAPI to efficiently reuse KV cache and save GPU computation, significantly improving performance for transformer-based models. For more information, refer to the Working with Transformers section.Built-in RoPE Support: TensorRT now includes built-in support for RoPE (Rotary Position Embedding) for transformers. This makes it easier to express RoPE and convert ONNX models with the new RotaryEmbedding API layer to TensorRT. For more information, refer to the Working with Transformers section.
Dynamic Quantization Enhancements: To support Sage Attention and other models that use per token quantization, Dynamic Quantization now supports up to 2D blocks, and Quantize and Dequantize supports up to ND blocks.
DLA Enhancements
DLA-Only Mode: A new ONNX Parser flag
kREPORT_CAPABILITY_DLAhas been added to generate TensorRT engines from an ONNX model that solely runs on DLA without GPU fallback, providing better deployment flexibility for DLA-targeted workloads. For more information, refer to the Enable DLA Mode When Parsing ONNX Networks section.
ONNX Parser Improvements
Plugin Override Control: The behavior for a TensorRT plugin sharing a name with a standard ONNX operator has been improved and a new ONNX Parser flag
kENABLE_PLUGIN_OVERRIDEwas introduced. For more information, refer to the Using Custom Layers When Importing a Model with a Parser section.
Samples and Tools
Strongly Typed Networks Sample: A new Python sample
strongly_type_autocasthas been added to showcase using ModelOpt’s AutoCast tool to convert a FP32 ONNX model to mixed FP32-FP16 precision, and building the engine with TensorRT’s strong typing mode. For more information, refer to the Convert ONNX FP32 Model to FP32-FP16 Mixed Precision section.
Bug Fixes and Performance
Windows GPU Support: Support for B200 and B300 GPUs on Windows is no longer considered experimental.
Memory Leak Fix: Fixed a host memory leak issue when building TensorRT engines on NVIDIA Blackwell GPUs.
Fused Multi-Head Attention (MHA): Multiple pointwise inputs are now supported by the Fused MHA implementation, and fixed a bug that previously prohibited users from having more than one
IAttentionin theINetwork.Performance Optimizations on Blackwell GPUs: Fixed multiple performance regressions including:
An up to 9% regression on B300 compared to B200 for FLUX with FP16 precision
An up to 24% regression on GB200 for ResNext-50 FP8 models when using CUDA 13.0
An up to 25% regression on GB200 for ConvNets with
GlobalAveragePooloperation like EfficientNetAn up to 10% regression on GB200 for BERT in FP16 precision
Python API Performance: Fixed an up to 40% performance regression with
set_input_shapefrom Python binding.
API Enhancements
API Change Tracking: To view API changes between releases, refer to the TensorRT GitHub repository and use the compare tool.
Breaking ABI Changes#
The object files, and therefore the symbols, inside the static library
libonnx_proto.ahave been merged into thelibnvonnxparser_static.astatic library. A symlink has been created for backward compatibility. Migrate your build to use the dynamic librarylibnvonnxparser.so. The static librarylibonnx_proto.aas well aslibnvonnxparser_static.awill be removed in TensorRT 11.0.The TensorRT Windows library files, with extension
*.dll, were previously located under thelibsubdirectory within the TensorRT zip package. These files are now located under thebinsubdirectory within the TensorRT zip package, which is a more common packaging schema for Windows.
Compatibility#
TensorRT 10.15.1 has been tested with the following:
PyTorch >= 2.0 (refer to the
requirements.txtfile for each sample)
This TensorRT release supports NVIDIA CUDA:
This TensorRT release requires at least NVIDIA driver r535 on Linux or r537 on Windows as required by CUDA 12.x, which is the minimum CUDA version supported by this TensorRT release. For CUDA 13.x the minimum NVIDIA driver version is r580.
Limitations#
The high-precision weights used in FP4 double quantization are not refittable.
Python samples do not support Python 3.13. Only the 3.13 Python bindings are currently supported.
Loops with scan outputs (
ILoopOutputLayerwithLoopOutputproperty being eitherLoopOutput::kCONCATENATEorLoopOutput::kREVERSE) must have the number of iterations set, that is, must have anITripLimitLayerwithTripLimit::kCOUNT. This requirement has always been present, but is now explicitly enforced instead of quietly having undefined behavior.ISelectLayermust have data inputs (thenInputandelseInput) of the same datatype.When implementing a custom layer using
IPluginV3plugin class where the custom layer has data-dependent shape (DDS), the size tensors must be of onlyINT64type and notINT32type, as the latter would result in a compilation failure. Related samples have been updated accordingly.There are no optimized FP8 Convolutions for Group Convolutions; therefore, INT8 is still recommended for ConvNets containing these convolution ops.
Shuffle-op cannot be transformed to no-op for perf improvement in some cases. For the NCHW32 format, TensorRT takes the third-to-last dimension as the channel dimension. When a Shuffle-op is added like [N, ‘C’, H, 1] -> [‘N’, C, H], the channel dimension changes to N, then this op cannot be transformed to no-op.
When running a FP32 model in FP16 or BF16 WeaklyTyped mode on Blackwell GPUs, if the FP32 weights values are used by FP16 kernels, TensorRT does not clip the weights to
[fp16_lowest, fp16_max]or[bf16_lowest, bf16_max]to avoid overflow likeinfvalues. If you seeinfgraph outputs on Blackwell GPUs only, check if any FP32 weights cannot be represented by either FP16 or BF16, and update the weights.The FP8 Convolutions on GPUs with SM89/90/120/121 do not support kernel sizes larger than 32 (for example, 7x7 convolutions); FP16 or FP32 fallback kernels will be used with suboptimal performance. Therefore, do not add FP8 Q/DQ ops before Convolutions with large kernel sizes for better performance.
When building the
nonZeroPluginsample on Windows, you might need to modify the CUDA version specified in theBuildCustomizationspaths in thevcxprojfile to match the installed version of CUDA.
Deprecated API Lifetime#
APIs deprecated in TensorRT 10.15 will be retained until 1/2027.
APIs deprecated in TensorRT 10.14 will be retained until 10/2026.
APIs deprecated in TensorRT 10.13 will be retained until 8/2026.
APIs deprecated in TensorRT 10.12 will be retained until 6/2026.
APIs deprecated in TensorRT 10.11 will be retained until 5/2026.
APIs deprecated in TensorRT 10.10 will be retained until 4/2026.
APIs deprecated in TensorRT 10.9 will be retained until 3/2026.
APIs deprecated in TensorRT 10.8 will be retained until 2/2026.
Deprecated and Removed Features#
For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.
The TensorRT static libraries are deprecated on Linux starting with TensorRT 10.11. If you are using the static libraries for building your application, migrate to building your application with the shared libraries. The following library files will be removed in TensorRT 11.0.
libnvinfer_static.alibnvinfer_plugin_static.alibnvinfer_lean_static.alibnvinfer_dispatch_static.alibnvinfer_vc_plugin_static.alibnvonnxparser_static.alibonnx_proto.a
INetwork->addNormalization()is deprecated; useINetwork->addNormalizationV2(), which correctly acceptsscaleandbiasinputs ofshape [numChannels].The executable files in Debian and RPM packages are now installed to
/usr/bininstead of/usr/src/tensorrt/bin/. To maintain backwards compatibility, the/usr/src/tensorrt/bin/directory contains symlinks to the binaries in/usr/bin.
Fixed Issues#
Support for B200 and B300 GPUs on Windows is no longer considered experimental and is now fully supported.
Fixed a host memory leak issue when building TensorRT engines on NVIDIA Blackwell GPUs.
Fixed an issue where Fused Multi-Head Attention (MHA) implementation did not support multiple pointwise inputs.
Fixed a bug that previously prohibited users from having more than one
IAttentionin theINetwork.Fixed an up to 9% performance regression on B300 compared to B200 for FLUX with FP16 precision. The regression did not exist for FLUX in FP8 or NVFP4 precisions.
Fixed an up to 24% performance regression on GB200 for ResNext-50 FP8 models when using CUDA 13.0 compared to CUDA 12.9 and TensorRT 10.13.0.
Fixed an up to 40% performance regression with
set_input_shapefrom Python binding.Fixed an up to 25% performance regression on GB200 for ConvNets with
GlobalAveragePooloperation like EfficientNet.Fixed an up to 10% performance regression on GB200 for BERT in FP16 precision.
PluginV3 AOT compilation would fail with
single_tacticmode due to missing tactic implementation.
Known Issues#
Functional
Running TensorRT with compute-sanitizer may report a
free-before-allocerror. Typically, this does not result in actual failures, and a fix is currently under investigation.Running the
demoDiffusionsample on Horizon with the Flux.1 Dev model using the FP16 precision may cause the system to reboot. The current workaround is to add the--batch-size 1flag to the command to run successfully. This issue is under investigation.FP16 Multi-Head Attention (MHA) layers are not fused when building engines with dynamic shapes, resulting in up to 10% performance regression compared to TensorRT 10.13. This issue only affects models using dynamic shapes with FP16 MHA layers.
The
SmallTileGEMM_TRTplugin (version 1) may not be found on QNX (Drive Orin) with CUDA 11.4, resulting inError Code 4: API Usage Error (Cannot find plugin: SmallTileGEMM_TRT, version: 1). This prevents engines using this plugin from running on the affected platform.The demoDiffusion inference pipeline only supports driver versions r580 and newer. The workaround to support lower driver versions is to install the CUDA/Python version that is compatible with the desired driver version.
On Windows, the nvinfer_builder_resource_sm100_10.dll library may fail to load or execute correctly, causing engine build failures for Blackwell GPUs (SM100). This issue is under investigation.
Valgrind may show
Invalid read of size 8when callingcuMemcpyDtoHAsync_v2and using CUDA 13.0 on edge Blackwell devices.Inputs to the
IRecurrenceLayermust always have the same shape. This means that ONNX models with loops whose recurrence inputs change shapes will be rejected.On CUDA versions prior to 13.2, the compute-sanitizer
initchecktool may flag false positiveUninitialized __global__ memory readerrors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored.For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in
kDLA_LINEARformat. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.inplace_addmini-sample of thequickly_deployable_pluginsPython sample may produce incorrect outputs on Windows.When linking with
libcudart_static.ausing a RedHatgcc-toolset-11or earlier compiler, you could encounter an issue where exception handling isn’t working. When a throw or exception happens, the catch is ignored, and an abort is raised, killing the program. This can be related to a linker bug causing theeh_frame_hdrELF segment to be empty. You can workaround this issue using a new linker, such as the one fromgcc-toolset-13.TensorRT may exit if inputs with invalid values are provided to the
RoiAlignplugin (ROIAlign_TRT), especially if there is inconsistency in the indices specified in thebatch_indicesinput and the actual batch size used.Asynchronous CUDA calls are not supported in the user-defined
processDebugTensorfunction for the debug tensor feature due to a bug in Windows 10.When running the FLUX Transformer model in 2048x2048 spatial dimensions, it may produce NaN outputs. This can be worked around with different spatial dimensions or switching from FP16 to BF16 precision.
The ONNX specification of the
NonMaxSuppressionoperation requires theiou_thresholdparameter to be in the range of[0.0-1.0]. However, TensorRT does not validate the value of the parameter; therefore, TensorRT will accept values outside of this range, in which case, the engine will continue executing as if the value was capped at either end of this range.PluginV2 in a loop or conditional scope is not supported. Upgrade to the PluginV3 interface as a WAR. This will impact some TensorRT-LLM models with GEMM plugins in a conditional scope.
Performance
SegResNet-style models have about 10-20% performance regression. Will be fixed in the next TensorRT release.
A non-zero
tilingOptimizationLevelmight introduce engine build failures for some networks on L4 GPUs.The
kREFITandkREFIT_IDENTICALhave performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.