TensorRT 10.15.1 Release Notes#

These Release Notes apply to x86 Linux and Windows users, ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux, and JetPack users. This release includes several fixes from the previous TensorRT releases and additional changes.

Announcements#

TensorRT 11.0 is coming soon with powerful new capabilities designed to accelerate your AI inference workflows:

  • Enhanced Developer Experience: Improved ease of use and seamless integration with PyTorch and Hugging Face ecosystems

  • Optimized for High-Growth Workloads: Stronger performance alignment across edge, automotive, and data center deployments

  • Modernized API: To streamline development, TensorRT 11.0 will remove legacy APIs including Weakly-typed APIs, Implicit INT8 quantization, IPluginV2, and TREX

    Action Required: We recommend migrating early for Strongly Typed Networks, Explicit Quantization, IPluginV3, and Nsight Deep Learning Designer.

Breaking packaging changes that may require updates to your build and deployment scripts:

  • Linux: trtexec and other executables are now installed to /usr/bin (previously /usr/src/tensorrt/bin/) and are added to the system PATH by default. Symlinks are provided for backward compatibility.

  • Windows: TensorRT library files (*.dll) are now located under the bin subdirectory (previously lib) within the TensorRT zip package.

  • Static libraries on Linux (libnvinfer_static.a, libnvonnxparser_static.a, etc.) are deprecated starting with TensorRT 10.11 and will be removed in TensorRT 11.0. Migrate to shared libraries.

Python Packaging Changes: Python 3.9 and older have reached end-of-life. To improve Python compatibility with upstream PyPI packages and the TensorRT Python samples, the RPM packages for RHEL/Rocky Linux 8 and RHEL/Rocky Linux 9 now depend on Python 3.12.

Platform Support: Debian 12 is supported for the Server Base System Architecture (SBSA) platform starting with the TensorRT 10.15 release.

Key Features and Enhancements#

Transformer and LLM Optimizations

  • KV Cache Reuse: Added KVCacheUpdate API to efficiently reuse KV cache and save GPU computation, significantly improving performance for transformer-based models. For more information, refer to the Working with Transformers section.

  • Built-in RoPE Support: TensorRT now includes built-in support for RoPE (Rotary Position Embedding) for transformers. This makes it easier to express RoPE and convert ONNX models with the new RotaryEmbedding API layer to TensorRT. For more information, refer to the Working with Transformers section.

  • Dynamic Quantization Enhancements: To support Sage Attention and other models that use per token quantization, Dynamic Quantization now supports up to 2D blocks, and Quantize and Dequantize supports up to ND blocks.

DLA Enhancements

  • DLA-Only Mode: A new ONNX Parser flag kREPORT_CAPABILITY_DLA has been added to generate TensorRT engines from an ONNX model that solely runs on DLA without GPU fallback, providing better deployment flexibility for DLA-targeted workloads. For more information, refer to the Enable DLA Mode When Parsing ONNX Networks section.

ONNX Parser Improvements

  • Plugin Override Control: The behavior for a TensorRT plugin sharing a name with a standard ONNX operator has been improved and a new ONNX Parser flag kENABLE_PLUGIN_OVERRIDE was introduced. For more information, refer to the Using Custom Layers When Importing a Model with a Parser section.

Samples and Tools

  • Strongly Typed Networks Sample: A new Python sample strongly_type_autocast has been added to showcase using ModelOpt’s AutoCast tool to convert a FP32 ONNX model to mixed FP32-FP16 precision, and building the engine with TensorRT’s strong typing mode. For more information, refer to the Convert ONNX FP32 Model to FP32-FP16 Mixed Precision section.

Bug Fixes and Performance

  • Windows GPU Support: Support for B200 and B300 GPUs on Windows is no longer considered experimental.

  • Memory Leak Fix: Fixed a host memory leak issue when building TensorRT engines on NVIDIA Blackwell GPUs.

  • Fused Multi-Head Attention (MHA): Multiple pointwise inputs are now supported by the Fused MHA implementation, and fixed a bug that previously prohibited users from having more than one IAttention in the INetwork.

  • Performance Optimizations on Blackwell GPUs: Fixed multiple performance regressions including:

    • An up to 9% regression on B300 compared to B200 for FLUX with FP16 precision

    • An up to 24% regression on GB200 for ResNext-50 FP8 models when using CUDA 13.0

    • An up to 25% regression on GB200 for ConvNets with GlobalAveragePool operation like EfficientNet

    • An up to 10% regression on GB200 for BERT in FP16 precision

  • Python API Performance: Fixed an up to 40% performance regression with set_input_shape from Python binding.

API Enhancements

Breaking ABI Changes#

  • The object files, and therefore the symbols, inside the static library libonnx_proto.a have been merged into the libnvonnxparser_static.a static library. A symlink has been created for backward compatibility. Migrate your build to use the dynamic library libnvonnxparser.so. The static library libonnx_proto.a as well as libnvonnxparser_static.a will be removed in TensorRT 11.0.

  • The TensorRT Windows library files, with extension *.dll, were previously located under the lib subdirectory within the TensorRT zip package. These files are now located under the bin subdirectory within the TensorRT zip package, which is a more common packaging schema for Windows.

Compatibility#

Limitations#

  • The high-precision weights used in FP4 double quantization are not refittable.

  • Python samples do not support Python 3.13. Only the 3.13 Python bindings are currently supported.

  • Loops with scan outputs (ILoopOutputLayer with LoopOutput property being either LoopOutput::kCONCATENATE or LoopOutput::kREVERSE) must have the number of iterations set, that is, must have an ITripLimitLayer with TripLimit::kCOUNT. This requirement has always been present, but is now explicitly enforced instead of quietly having undefined behavior.

  • ISelectLayer must have data inputs (thenInput and elseInput) of the same datatype.

  • When implementing a custom layer using IPluginV3 plugin class where the custom layer has data-dependent shape (DDS), the size tensors must be of only INT64 type and not INT32 type, as the latter would result in a compilation failure. Related samples have been updated accordingly.

  • There are no optimized FP8 Convolutions for Group Convolutions; therefore, INT8 is still recommended for ConvNets containing these convolution ops.

  • Shuffle-op cannot be transformed to no-op for perf improvement in some cases. For the NCHW32 format, TensorRT takes the third-to-last dimension as the channel dimension. When a Shuffle-op is added like [N, ‘C’, H, 1] -> [‘N’, C, H], the channel dimension changes to N, then this op cannot be transformed to no-op.

  • When running a FP32 model in FP16 or BF16 WeaklyTyped mode on Blackwell GPUs, if the FP32 weights values are used by FP16 kernels, TensorRT does not clip the weights to [fp16_lowest, fp16_max] or [bf16_lowest, bf16_max] to avoid overflow like inf values. If you see inf graph outputs on Blackwell GPUs only, check if any FP32 weights cannot be represented by either FP16 or BF16, and update the weights.

  • The FP8 Convolutions on GPUs with SM89/90/120/121 do not support kernel sizes larger than 32 (for example, 7x7 convolutions); FP16 or FP32 fallback kernels will be used with suboptimal performance. Therefore, do not add FP8 Q/DQ ops before Convolutions with large kernel sizes for better performance.

  • When building the nonZeroPlugin sample on Windows, you might need to modify the CUDA version specified in the BuildCustomizations paths in the vcxproj file to match the installed version of CUDA.

Deprecated API Lifetime#

  • APIs deprecated in TensorRT 10.15 will be retained until 1/2027.

  • APIs deprecated in TensorRT 10.14 will be retained until 10/2026.

  • APIs deprecated in TensorRT 10.13 will be retained until 8/2026.

  • APIs deprecated in TensorRT 10.12 will be retained until 6/2026.

  • APIs deprecated in TensorRT 10.11 will be retained until 5/2026.

  • APIs deprecated in TensorRT 10.10 will be retained until 4/2026.

  • APIs deprecated in TensorRT 10.9 will be retained until 3/2026.

  • APIs deprecated in TensorRT 10.8 will be retained until 2/2026.

Deprecated and Removed Features#

  • For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.

  • The TensorRT static libraries are deprecated on Linux starting with TensorRT 10.11. If you are using the static libraries for building your application, migrate to building your application with the shared libraries. The following library files will be removed in TensorRT 11.0.

    • libnvinfer_static.a

    • libnvinfer_plugin_static.a

    • libnvinfer_lean_static.a

    • libnvinfer_dispatch_static.a

    • libnvinfer_vc_plugin_static.a

    • libnvonnxparser_static.a

    • libonnx_proto.a

  • INetwork->addNormalization() is deprecated; use INetwork->addNormalizationV2(), which correctly accepts scale and bias inputs of shape [numChannels].

  • The executable files in Debian and RPM packages are now installed to /usr/bin instead of /usr/src/tensorrt/bin/. To maintain backwards compatibility, the /usr/src/tensorrt/bin/ directory contains symlinks to the binaries in /usr/bin.

Fixed Issues#

  • Support for B200 and B300 GPUs on Windows is no longer considered experimental and is now fully supported.

  • Fixed a host memory leak issue when building TensorRT engines on NVIDIA Blackwell GPUs.

  • Fixed an issue where Fused Multi-Head Attention (MHA) implementation did not support multiple pointwise inputs.

  • Fixed a bug that previously prohibited users from having more than one IAttention in the INetwork.

  • Fixed an up to 9% performance regression on B300 compared to B200 for FLUX with FP16 precision. The regression did not exist for FLUX in FP8 or NVFP4 precisions.

  • Fixed an up to 24% performance regression on GB200 for ResNext-50 FP8 models when using CUDA 13.0 compared to CUDA 12.9 and TensorRT 10.13.0.

  • Fixed an up to 40% performance regression with set_input_shape from Python binding.

  • Fixed an up to 25% performance regression on GB200 for ConvNets with GlobalAveragePool operation like EfficientNet.

  • Fixed an up to 10% performance regression on GB200 for BERT in FP16 precision.

  • PluginV3 AOT compilation would fail with single_tactic mode due to missing tactic implementation.

Known Issues#

Functional

  • Running TensorRT with compute-sanitizer may report a free-before-alloc error. Typically, this does not result in actual failures, and a fix is currently under investigation.

  • Running the demoDiffusion sample on Horizon with the Flux.1 Dev model using the FP16 precision may cause the system to reboot. The current workaround is to add the --batch-size 1 flag to the command to run successfully. This issue is under investigation.

  • FP16 Multi-Head Attention (MHA) layers are not fused when building engines with dynamic shapes, resulting in up to 10% performance regression compared to TensorRT 10.13. This issue only affects models using dynamic shapes with FP16 MHA layers.

  • The SmallTileGEMM_TRT plugin (version 1) may not be found on QNX (Drive Orin) with CUDA 11.4, resulting in Error Code 4: API Usage Error (Cannot find plugin: SmallTileGEMM_TRT, version: 1). This prevents engines using this plugin from running on the affected platform.

  • The demoDiffusion inference pipeline only supports driver versions r580 and newer. The workaround to support lower driver versions is to install the CUDA/Python version that is compatible with the desired driver version.

  • On Windows, the nvinfer_builder_resource_sm100_10.dll library may fail to load or execute correctly, causing engine build failures for Blackwell GPUs (SM100). This issue is under investigation.

  • Valgrind may show Invalid read of size 8 when calling cuMemcpyDtoHAsync_v2 and using CUDA 13.0 on edge Blackwell devices.

  • Inputs to the IRecurrenceLayer must always have the same shape. This means that ONNX models with loops whose recurrence inputs change shapes will be rejected.

  • On CUDA versions prior to 13.2, the compute-sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored.

  • For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in kDLA_LINEAR format. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.

  • inplace_add mini-sample of the quickly_deployable_plugins Python sample may produce incorrect outputs on Windows.

  • When linking with libcudart_static.a using a RedHat gcc-toolset-11 or earlier compiler, you could encounter an issue where exception handling isn’t working. When a throw or exception happens, the catch is ignored, and an abort is raised, killing the program. This can be related to a linker bug causing the eh_frame_hdr ELF segment to be empty. You can workaround this issue using a new linker, such as the one from gcc-toolset-13.

  • TensorRT may exit if inputs with invalid values are provided to the RoiAlign plugin (ROIAlign_TRT), especially if there is inconsistency in the indices specified in the batch_indices input and the actual batch size used.

  • Asynchronous CUDA calls are not supported in the user-defined processDebugTensor function for the debug tensor feature due to a bug in Windows 10.

  • When running the FLUX Transformer model in 2048x2048 spatial dimensions, it may produce NaN outputs. This can be worked around with different spatial dimensions or switching from FP16 to BF16 precision.

  • The ONNX specification of the NonMaxSuppression operation requires the iou_threshold parameter to be in the range of [0.0-1.0]. However, TensorRT does not validate the value of the parameter; therefore, TensorRT will accept values outside of this range, in which case, the engine will continue executing as if the value was capped at either end of this range.

  • PluginV2 in a loop or conditional scope is not supported. Upgrade to the PluginV3 interface as a WAR. This will impact some TensorRT-LLM models with GEMM plugins in a conditional scope.

Performance

  • SegResNet-style models have about 10-20% performance regression. Will be fixed in the next TensorRT release.

  • A non-zero tilingOptimizationLevel might introduce engine build failures for some networks on L4 GPUs.

  • The kREFIT and kREFIT_IDENTICAL have performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.