TensorRT 11.1.0 Release Notes#
These Release Notes apply to x86 Linux and Windows users, and ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux.
Announcements#
CUDA 13.3 dependency upgrade: Updates the CUDA Toolkit baseline to CUDA 13.3 across Linux x86-64, Windows x64, and SBSA platforms. Refer to the TensorRT Support Matrix for the per-platform CUDA version pinning and to Prerequisites for installer prerequisites.
Ubuntu 26.04 support: Adds Ubuntu 26.04 LTS to the supported Linux x86-64 and SBSA platform lists alongside the existing Ubuntu 22.04/24.04 packages. Refer to Method 2: Debian Package Installation and Method 4: Tar File Installation for the full Linux x86-64 and SBSA distribution lists and to the TensorRT Support Matrix for the per-platform compiler, glibc, and Python tuples.
Python 3.14 bindings: Extends the Python wheel matrix to Python 3.14 on supported platforms. Refer to Method 1: Python Package Index (pip) for installing the Python 3.14 wheel and to the TensorRT Support Matrix for the per-platform Python version table.
Key Features and Enhancements#
MoE (Mixture of Experts)
NVFP4 dual-GEMM fusion (gate + up projection) for SM121: Fuses the gate and up projection GEMMs in NVFP4 MoE/MLP blocks on NVIDIA DGX Spark (compute capability 12.1). For more information, refer to the MoE (Mixture of Experts) section.
Performance
Global Performance Tuner: Adds automated end-to-end performance tuning via build-route searching through
trtexecto explore internal builder knobs, benchmark candidate engines, and optionally validate accuracy before selecting the fastest valid route. Refer to Global Performance Tuning.
Breaking ABI Changes#
TensorRT 11.0 removed weak typing APIs along with several other deprecated APIs. Refer to the NVIDIA TensorRT Migration Guide for more details.
Breaking Behavioral Changes#
trtexec
The
--useCudaGraph,--noDataTransfers,--useSpinWait, and--separateProfileRunflags were enabled by default and deprecated in TensorRT 11.0. Each flag is still accepted but has no effect. For replacement flags and details, refer to the trtexec migration page.The
--stronglyTypedflag has no effect in TensorRT 11.0.0+, since strongly typed networks are now the default. The flag is still accepted for backward compatibility.
Compatibility#
For comprehensive platform compatibility, hardware requirements, and feature availability information, refer to the TensorRT Support Matrix. The support matrix provides detailed information about supported operating systems, CUDA versions, GPU architectures, precision modes, compiler requirements, and ONNX operator support for TensorRT 11.1.0.
Limitations#
DLA is not supported in TensorRT 11.0 or 11.1; DLA support will be reintroduced in a later minor version update. Refer to Working with DLA for build and runtime guidance when DLA becomes available again.
The high-precision weights used in FP4 double quantization are not refittable.
Version Compatible engine builds from ONNX models with explicit quantization/dequantization (Q/DQ) nodes may fail during engine build if Q/DQ is per-channel scaling for convolution filter.
Python samples do not support Python 3.13 or 3.14. TensorRT Python bindings for Python 3.13 and 3.14 are available on supported platforms listed in the TensorRT Support Matrix, but Windows Python 3.14 support is preliminary. On Windows, Python 3.14 applications that use NumPy or other native extensions must install compatible package builds; incompatible or experimental builds can crash during import before TensorRT is loaded. Refer to Method 1: Python Package Index (pip) for installing the Python wheel.
ISelectLayermust have data inputs (thenInputandelseInput) of the same datatype.When implementing a custom layer using
IPluginV3plugin class where the custom layer has data-dependent shape (DDS), the size tensors must be of onlyINT64type and notINT32type, as the latter would result in a compilation failure. Related samples have been updated accordingly.
Deprecated API Lifetime#
APIs deprecated in TensorRT 11.1 will be retained until 6/2027.
APIs deprecated in TensorRT 11.0.0 will be retained until 3/2027.
APIs deprecated in TensorRT 10.16 will be retained until 3/2027.
APIs deprecated in TensorRT 10.15 will be retained until 1/2027.
APIs deprecated in TensorRT 10.14 will be retained until 10/2026.
APIs deprecated in TensorRT 10.13 will be retained until 8/2026.
See also
- Migrating from TensorRT 10.x to 11.x
Step-by-step migration guide for upgrading from TensorRT 10.x.
- Upgrading TensorRT
Package-level upgrade instructions (pip, Debian, RPM, tar, zip).
Deprecated and Removed Features#
For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.
Windows 10 support is considered deprecated and Windows 10 will no longer be supported by TensorRT after October 2026. Windows 10 has been End-of-Life since October 2025. GeForce driver support for Windows 10 will also be reduced at that time. For more information, refer to GeForce Support Plan for Windows 10.
The TensorRT Python bindings for Python versions 3.8 and 3.9 are deprecated. These Python versions are considered End-of-Life by Python upstream and have not been supported by TensorRT’s samples for multiple releases. These bindings will be removed in a future release.
Fixed Issues#
Fixed an accuracy issue for broadcasting elementwise layers running on DLA with GPU fallback enabled when one input has shape NxCxHxW, the other has shape Nx1x1x1, and at least one input uses
kDLA_LINEARformat.Fixed
ECCommunicatorAPITests.SetCommunicatorWorksWithAllAllocationStrategiesruntime test timeouts on B300 platforms when running against NCCL 2.29.4. The timeouts were caused by a long (~21–22 second) cold-initialization latency in the firstncclCommInitRankcall during NCCL’s “kernels init” phase; the behavior reproduced with an NCCL-only reproducer and was not caused by TensorRT runtime logic. NCCL 2.30.x reduces first-init latency to under one second, resolving the timeout. This addresses the B300 NCCL cold-init latency described in the TensorRT 11.0.0 release notes Known Issues. For NCCL minimum versions and installation, refer to Prerequisites and the Support Matrix.Fixed ONNX Runtime TensorRT Execution Provider compilation against TensorRT 11.x. ONNX Runtime releases 1.24.4 and earlier could not compile the provider because they still referenced APIs removed in TensorRT 11.0, namely
IBuilder::platformHasFastFp16(),IBuilder::platformHasFastInt8(), andIBuilderConfig::setInt8Calibrator()(removed as part of the weak-typing and implicit-quantization cleanup; refer to the NVIDIA TensorRT Migration Guide and the C++ API migration reference). ONNX Runtime 1.27 and later adopts the TensorRT 11.x API and supports TensorRT 11.0. This addresses the provider compilation failure described in the TensorRT 11.0.0 release notes Known Issues. For the direct ONNX-TRT build path (without ONNX Runtime), refer to Example Deployment Using ONNX.Fixed a ~5% to 10% EfficientNet/RegNet GPU inference performance regression on the RTX PRO 6000 Blackwell platform. This addresses the performance regression described in the TensorRT 11.0.0 release notes Known Issues.
Fixed a crash in the
IPluginV3creator dispatch path during build or deserialization when a plugin advertised aPluginFieldwith anullptrdatapointer andlength == 0. EmptyPluginFieldentries with null data are now handled safely, so plugin authors no longer need to populate unused fields with a non-null sentinel buffer. For background on related V3 migration pitfalls, refer to Known Migration Issues.Fixed broken in-document hyperlinks in the
README.mdof the OSSsamples/python/quickly_deployable_pluginsPython sample, including references to the circular padding plugin section and other anchor targets. Cross-references in the README now resolve to the intended sections.
Known Issues#
Functional
The
ECCommunicatorAPITests.SetCommunicatorFailsWithoutSupportedLayerandECCommunicatorAPITests.SetCommunicatorSucceedsWithDistCollectiveruntime tests report errors under valgrind (host) and compute-sanitizer memcheck (GPU). Host-side memory leaks are caused by NCCL internal allocations duringncclCommInitRankandncclCommSplitand reproduce on H100 and B100 platforms. GPU-side errors are NCCL kernel probing failures that occur on H100 (SM 90) when the sanitizer intercepts CUDA API errors before NCCL can handle them internally; B100 is not affected on the GPU side.On CUDA versions prior to 13.2, the compute-sanitizer
initchecktool may flagUninitialized __global__ memory readerrors when running TensorRT applications on NVIDIA Hopper GPUs. These errors are false positives and can be safely ignored. To suppress them, upgrade to CUDA 13.2 or later.Running TensorRT applications under Valgrind memcheck on CUDA 13.3 may report host-side memory leaks attributed to NVRTC components used during engine build and runtime compilation. These reports are not known to affect normal inference behavior. This will be addressed in a future CUDA or TensorRT release.
Running TensorRT inference under compute-sanitizer
memcheckmay report device memory errors during execution or engine teardown. These reports are not known to affect normal inference behavior when sanitizers are not enabled. This will be addressed in a future release.inplace_addmini-sample of thequickly_deployable_pluginsPython sample may produce incorrect outputs on Windows.TensorRT may exit if inputs with invalid values are provided to the
RoiAlignplugin (ROIAlign_TRT), especially if there is inconsistency in the indices specified in thebatch_indicesinput and the actual batch size used.trtexecexits with[E] Unknown option: --stronglyTypedwhen the deprecated--stronglyTypedflag is specified more than once on the command line. The first occurrence is accepted as a deprecated no-op, but the argument parser rejects subsequent occurrences instead of also treating them as no-ops. As a workaround, ensure--stronglyTypedappears at most once, or omit it entirely since strongly typed networks are now the default in TensorRT 11.0.0. This will be fixed in a future release.On Windows, the TensorRT 11.0.0 runtime can fail to deserialize a Version Compatible engine that was built with TensorRT 10.1.0. Deserialization fails inside the runtime dispatch VC plugin loader when loading
nvinfer_vc_plugin.dll(WindowsERROR_MOD_NOT_FOUND/ error 126), surfaces as TensorRT error code 6 (API Usage Error), and leaves the returnedICudaEnginepointer null. Later TensorRT 10.x builders are not affected; only engines produced by TensorRT 10.1.0 have been observed to trigger this failure. As a workaround, rebuild the affected VC engines with a newer TensorRT 10.x release (10.2 or later) before deserializing them with the TensorRT 11.0.0 runtime on Windows. This will be addressed in a future release.On NVIDIA Blackwell B100, some networks that use attention patterns can have invalid memory access or fail during
IExecutionContextinitialization with a CUDA driver error reportingCUDA_ERROR_INVALID_HANDLE(“Cannot pass CUkernel handle to this API”). The failure has been observed on B100 with Ubuntu 24.04 and CUDA 13.2.1. This will be addressed in a future release.On Windows, Version Compatible engines built with TensorRT 10.1 through 10.4 that use the
RoiAlignplugin (ROIAlign_TRT) cannot be deserialized with the TensorRT 11.x runtime due to a bug in how those engines were produced. The bug was fixed in TensorRT 10.5. Engines built with TensorRT 10.5 or later are expected to deserialize correctly. As a workaround, rebuild affected engines with TensorRT 10.5 or later before loading them with TensorRT 11.x.Building some explicitly quantized FP8 networks with convolution layers that use large spatial strides (such as vision-transformer patch-embedding layers) may fail during engine build with no available tactics. This has been observed on NVIDIA Blackwell platforms. As a workaround, keep affected convolution layers in a higher-precision format rather than FP8 until fallback support is available. This will be addressed in a future release.
Inspecting very large engines with the Engine Inspector may fail on systems with limited host memory. The inspection process can exceed available memory and be terminated by the operating system without a TensorRT error message. This has been observed with very large diffusion-model engines. As a workaround, run engine inspection on a system with sufficient host memory. This will be addressed in a future release.
Some explicitly quantized NVFP4 diffusion transformer models may produce all-NaN outputs during inference on NVIDIA Blackwell platforms, resulting in incorrect results compared with the reference output. This will be addressed in a future release.
Performance
A non-zero
tilingOptimizationLevelmight introduce engine build failures for some networks on L4 GPUs.The
kREFITandkREFIT_IDENTICALhave performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.On RTX PRO 6000 Blackwell Max-Q (sm120, aarch64), some strongly-typed FP16 networks exhibit GPU time regressions compared with TensorRT 10.16.1. The slowdown is concentrated in fused-graph kernels, in particular, fused Conv/BiasAdd patterns and mean-reduction kernels, and has been observed at up to ~45% on individual networks. The regression is stable across reruns. This issue will be addressed in a future release.
Some networks may use more execution-context GPU memory in TensorRT 11.1 than in TensorRT 11.0 without a corresponding inference speedup. This will be addressed in a future release.
During engine build, some ONNX networks may consume substantially more host (CPU) memory in TensorRT 11.1 than in TensorRT 11.0. An increase of approximately 25–30% has been observed on NVIDIA H100 when building certain speech-synthesis models with TF32 enabled. This will be addressed in a future release.
Some transformer models that use multi-head attention, including vision-transformer architectures, may run slower on GPU in TensorRT 11.x than in TensorRT 10.16 due to additional layout-related overhead during inference. This will be addressed in a future release.