TensorRT 10.16.0 Release Notes#

These Release Notes apply to x86 Linux and Windows users, ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux. This release includes several fixes from the previous TensorRT releases and additional changes.

Announcements#

TensorRT 11.0 is coming soon with powerful new capabilities designed to accelerate your AI inference workflows:

  • Enhanced Developer Experience: Improved ease of use and seamless integration with PyTorch and Hugging Face ecosystems

  • Optimized for High-Growth Workloads: Stronger performance alignment across edge, automotive, and data center deployments

  • Modernized API: To streamline development, TensorRT 11.0 will remove legacy APIs including Weakly-typed APIs, Implicit INT8 quantization, IPluginV2, and TREX

    Action Required: We recommend migrating early for Strongly Typed Networks, Explicit Quantization, IPluginV3, and Nsight Deep Learning Designer.

Key Features and Enhancements#

MoE (Mixture of Experts)

Built-in support for MoE (Mixture of Experts) layers in transformer models using the new IMoELayer API. Currently supported on SM110 with NVFP4 double-quantized weights and FP8 static quantization for activations. For more information, refer to the MoE (Mixture of Experts) section.

Multi-Device Inference (Preview Feature)

Scale inference across multiple GPUs for models that exceed single-device memory or benefit from parallel execution. Enable with PreviewFeature::kMULTIDEVICE_RUNTIME_10_16 in the builder config. For more information, refer to the Multi-Device Inference section.

  • DistCollective: New IDistCollectiveLayer for graph-level distributed collectives via NCCL. Requires Ampere (SM 80) or later.

  • Multi-Device Attention: Context-parallel attention that splits KV sequences across GPUs via setNbRanks on IAttention. BF16 and FP16 only. Requires Blackwell (SM 100) or later. For more information, refer to the Working with Transformers section.

  • Communicator Integration: Attach an existing NCCL communicator via IExecutionContext::setCommunicator() to use multi-device inference with existing distributed workflows (MPI, SLURM).

Interactive Sample Explorer

A new interactive Sample Explorer provides filtered, searchable access to all TensorRT samples. Browse samples by difficulty, language, or use case to find the right starting point for your project.

Developer Tools

  • API Capture and Replay Multi-Network Support: TensorRT API Capture and Replay now supports capturing multiple networks within a single process. This enables recording and replaying applications that create and build multiple TensorRT networks, such as ensemble models or multi-stage inference pipelines. For more information, refer to the Multi-Network Support section in the TensorRT API Capture and Replay documentation.

API Enhancements

Breaking ABI Changes#

  • The object files, and therefore the symbols, inside the static library libonnx_proto.a have been merged into the libnvonnxparser_static.a static library. A symlink has been created for backward compatibility. Migrate your build to use the dynamic library libnvonnxparser.so. The static library libonnx_proto.a as well as libnvonnxparser_static.a will be removed in TensorRT 11.0.

  • The TensorRT Windows library files, with extension *.dll, were previously located under the lib subdirectory within the TensorRT zip package. These files are now located under the bin subdirectory within the TensorRT zip package, which is a more common packaging schema for Windows.

Compatibility#

For comprehensive platform compatibility, hardware requirements, and feature availability information, refer to the TensorRT Support Matrix. The support matrix provides detailed information about supported operating systems, CUDA versions, GPU architectures, precision modes, compiler requirements, and ONNX operator support for TensorRT 10.16.0.

Limitations#

  • The high-precision weights used in FP4 double quantization are not refittable.

  • Python samples do not support Python 3.13. Only the 3.13 Python bindings are currently supported.

  • Loops with scan outputs (ILoopOutputLayer with LoopOutput property being either LoopOutput::kCONCATENATE or LoopOutput::kREVERSE) must have the number of iterations set, that is, must have an ITripLimitLayer with TripLimit::kCOUNT. This requirement has always been present, but is now explicitly enforced instead of quietly having undefined behavior.

  • ISelectLayer must have data inputs (thenInput and elseInput) of the same datatype.

  • When implementing a custom layer using IPluginV3 plugin class where the custom layer has data-dependent shape (DDS), the size tensors must be of only INT64 type and not INT32 type, as the latter would result in a compilation failure. Related samples have been updated accordingly.

  • There are no optimized FP8 Convolutions for Group Convolutions; therefore, INT8 is still recommended for ConvNets containing these convolution ops.

  • Shuffle-op cannot be transformed to no-op for perf improvement in some cases. For the NCHW32 format, TensorRT takes the third-to-last dimension as the channel dimension. When a Shuffle-op is added like [N, ‘C’, H, 1] -> [‘N’, C, H], the channel dimension changes to N, then this op cannot be transformed to no-op.

  • When running a FP32 model in FP16 or BF16 WeaklyTyped mode on Blackwell GPUs, if the FP32 weights values are used by FP16 kernels, TensorRT does not clip the weights to [fp16_lowest, fp16_max] or [bf16_lowest, bf16_max] to avoid overflow like inf values. If you see inf graph outputs on Blackwell GPUs only, check if any FP32 weights cannot be represented by either FP16 or BF16, and update the weights.

  • The FP8 Convolutions on GPUs with SM89/90/120/121 do not support kernel sizes larger than 32 (for example, 7x7 convolutions); FP16 or FP32 fallback kernels will be used with suboptimal performance. Therefore, do not add FP8 Q/DQ ops before Convolutions with large kernel sizes for better performance.

  • When building the nonZeroPlugin sample on Windows, you might need to modify the CUDA version specified in the BuildCustomizations paths in the vcxproj file to match the installed version of CUDA.

Deprecated API Lifetime#

  • APIs deprecated in TensorRT 10.16 will be retained until 3/2027.

  • APIs deprecated in TensorRT 10.15 will be retained until 1/2027.

  • APIs deprecated in TensorRT 10.14 will be retained until 10/2026.

  • APIs deprecated in TensorRT 10.13 will be retained until 8/2026.

  • APIs deprecated in TensorRT 10.12 will be retained until 6/2026.

  • APIs deprecated in TensorRT 10.11 will be retained until 5/2026.

  • APIs deprecated in TensorRT 10.10 will be retained until 4/2026.

Deprecated and Removed Features#

  • For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.

  • The TensorRT static libraries are deprecated on Linux starting with TensorRT 10.11. If you are using the static libraries for building your application, migrate to building your application with the shared libraries. The following library files will be removed in TensorRT 11.0.

    • libnvinfer_static.a

    • libnvinfer_plugin_static.a

    • libnvinfer_lean_static.a

    • libnvinfer_dispatch_static.a

    • libnvinfer_vc_plugin_static.a

    • libnvonnxparser_static.a

    • libonnx_proto.a

  • INetwork->addNormalization() is deprecated; use INetwork->addNormalizationV2(), which correctly accepts scale and bias inputs of shape [numChannels].

  • The executable files in Debian and RPM packages are now installed to /usr/bin instead of /usr/src/tensorrt/bin/. To maintain backwards compatibility, the /usr/src/tensorrt/bin/ directory contains symlinks to the binaries in /usr/bin.

  • The sampleCharRNN sample is deprecated with no planned alternative.

Fixed Issues#

  • Fixed an issue where EinsumLayer::setEquation was always returning false. Einsum layer equation validation now returns the correct result.

  • Fixed an issue where FP16 Multi-Head Attention (MHA) layers were not fused when building engines with dynamic shapes, which caused up to 10% performance regression compared to TensorRT 10.13.

  • Fixed a performance issue where SegResNet-style models have about 10-20% performance regression.

  • Fixed a free-before-alloc issue when running TensorRT with compute-sanitizer.

  • Fixed an issue where running the demoDiffusion sample on Horizon with the Flux.1 Dev model using FP16 precision could cause the system to reboot.

  • Updated the demoDiffusion sample README to document driver version compatibility requirements for the inference pipeline.

Known Issues#

Functional

  • The ECCommunicatorAPITests.SetCommunicatorFailsWithoutSupportedLayer and ECCommunicatorAPITests.SetCommunicatorSucceedsWithDistCollective runtime tests report errors under valgrind (host) and compute-sanitizer memcheck (GPU). Host-side memory leaks are caused by NCCL internal allocations during ncclCommInitRank and ncclCommSplit and reproduce on H100 and B100 platforms. GPU-side errors are NCCL kernel probing failures that occur on H100 (SM 90) when the sanitizer intercepts CUDA API errors before NCCL can handle them internally; B100 is not affected on the GPU side.

  • The SmallTileGEMM_TRT plugin (version 1) may not be found on QNX (Drive Orin) with CUDA 11.4, resulting in Error Code 4: API Usage Error (Cannot find plugin: SmallTileGEMM_TRT, version: 1). This prevents engines using this plugin from running on the affected platform.

  • Valgrind may show Invalid read of size 8 when calling cuMemcpyDtoHAsync_v2 and using CUDA 13.0 on edge Blackwell devices.

  • Inputs to the IRecurrenceLayer must always have the same shape. This means that ONNX models with loops whose recurrence inputs change shapes will be rejected.

  • On CUDA versions prior to 13.2, the compute-sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored.

  • For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in kDLA_LINEAR format. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.

  • inplace_add mini-sample of the quickly_deployable_plugins Python sample may produce incorrect outputs on Windows.

  • TensorRT may exit if inputs with invalid values are provided to the RoiAlign plugin (ROIAlign_TRT), especially if there is inconsistency in the indices specified in the batch_indices input and the actual batch size used.

  • The ONNX specification of the NonMaxSuppression operation requires the iou_threshold parameter to be in the range of [0.0-1.0]. However, TensorRT does not validate the value of the parameter; therefore, TensorRT will accept values outside of this range, in which case, the engine will continue executing as if the value was capped at either end of this range.

  • PluginV2 in a loop or conditional scope is not supported. Upgrade to the PluginV3 interface as a WAR. This will impact some TensorRT-LLM models with GEMM plugins in a conditional scope.

Performance

  • EfficientNet/RegNet has a ~5 - 10% performance regression on RTX PRO 6000 Blackwell platform.

  • A non-zero tilingOptimizationLevel might introduce engine build failures for some networks on L4 GPUs.

  • The kREFIT and kREFIT_IDENTICAL have performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.