TensorRT 11.1.0 Release Notes#

These Release Notes apply to x86 Linux and Windows users, and ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux.

Announcements#

Key Features and Enhancements#

MoE (Mixture of Experts)

  • NVFP4 dual-GEMM fusion (gate + up projection) for SM121: Fuses the gate and up projection GEMMs in NVFP4 MoE/MLP blocks on NVIDIA DGX Spark (compute capability 12.1). For more information, refer to the MoE (Mixture of Experts) section.

Performance

  • Global Performance Tuner: Adds automated end-to-end performance tuning via build-route searching through trtexec to explore internal builder knobs, benchmark candidate engines, and optionally validate accuracy before selecting the fastest valid route. Refer to Global Performance Tuning.

Breaking ABI Changes#

Breaking Behavioral Changes#

trtexec

  • The --useCudaGraph, --noDataTransfers, --useSpinWait, and --separateProfileRun flags were enabled by default and deprecated in TensorRT 11.0. Each flag is still accepted but has no effect. For replacement flags and details, refer to the trtexec migration page.

  • The --stronglyTyped flag has no effect in TensorRT 11.0.0+, since strongly typed networks are now the default. The flag is still accepted for backward compatibility.

Compatibility#

For comprehensive platform compatibility, hardware requirements, and feature availability information, refer to the TensorRT Support Matrix. The support matrix provides detailed information about supported operating systems, CUDA versions, GPU architectures, precision modes, compiler requirements, and ONNX operator support for TensorRT 11.1.0.

Limitations#

  • DLA is not supported in TensorRT 11.0 or 11.1; DLA support will be reintroduced in a later minor version update. Refer to Working with DLA for build and runtime guidance when DLA becomes available again.

  • The high-precision weights used in FP4 double quantization are not refittable.

  • Version Compatible engine builds from ONNX models with explicit quantization/dequantization (Q/DQ) nodes may fail during engine build if Q/DQ is per-channel scaling for convolution filter.

  • Python samples do not support Python 3.13 or 3.14. TensorRT Python bindings for Python 3.13 and 3.14 are available on supported platforms listed in the TensorRT Support Matrix, but Windows Python 3.14 support is preliminary. On Windows, Python 3.14 applications that use NumPy or other native extensions must install compatible package builds; incompatible or experimental builds can crash during import before TensorRT is loaded. Refer to Method 1: Python Package Index (pip) for installing the Python wheel.

  • ISelectLayer must have data inputs (thenInput and elseInput) of the same datatype.

  • When implementing a custom layer using IPluginV3 plugin class where the custom layer has data-dependent shape (DDS), the size tensors must be of only INT64 type and not INT32 type, as the latter would result in a compilation failure. Related samples have been updated accordingly.

Deprecated API Lifetime#

  • APIs deprecated in TensorRT 11.1 will be retained until 6/2027.

  • APIs deprecated in TensorRT 11.0.0 will be retained until 3/2027.

  • APIs deprecated in TensorRT 10.16 will be retained until 3/2027.

  • APIs deprecated in TensorRT 10.15 will be retained until 1/2027.

  • APIs deprecated in TensorRT 10.14 will be retained until 10/2026.

  • APIs deprecated in TensorRT 10.13 will be retained until 8/2026.

See also

Migrating from TensorRT 10.x to 11.x

Step-by-step migration guide for upgrading from TensorRT 10.x.

Upgrading TensorRT

Package-level upgrade instructions (pip, Debian, RPM, tar, zip).

Deprecated and Removed Features#

  • For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.

  • Windows 10 support is considered deprecated and Windows 10 will no longer be supported by TensorRT after October 2026. Windows 10 has been End-of-Life since October 2025. GeForce driver support for Windows 10 will also be reduced at that time. For more information, refer to GeForce Support Plan for Windows 10.

  • The TensorRT Python bindings for Python versions 3.8 and 3.9 are deprecated. These Python versions are considered End-of-Life by Python upstream and have not been supported by TensorRT’s samples for multiple releases. These bindings will be removed in a future release.

Fixed Issues#

  • Fixed an accuracy issue for broadcasting elementwise layers running on DLA with GPU fallback enabled when one input has shape NxCxHxW, the other has shape Nx1x1x1, and at least one input uses kDLA_LINEAR format.

  • Fixed ECCommunicatorAPITests.SetCommunicatorWorksWithAllAllocationStrategies runtime test timeouts on B300 platforms when running against NCCL 2.29.4. The timeouts were caused by a long (~21–22 second) cold-initialization latency in the first ncclCommInitRank call during NCCL’s “kernels init” phase; the behavior reproduced with an NCCL-only reproducer and was not caused by TensorRT runtime logic. NCCL 2.30.x reduces first-init latency to under one second, resolving the timeout. This addresses the B300 NCCL cold-init latency described in the TensorRT 11.0.0 release notes Known Issues. For NCCL minimum versions and installation, refer to Prerequisites and the Support Matrix.

  • Fixed ONNX Runtime TensorRT Execution Provider compilation against TensorRT 11.x. ONNX Runtime releases 1.24.4 and earlier could not compile the provider because they still referenced APIs removed in TensorRT 11.0, namely IBuilder::platformHasFastFp16(), IBuilder::platformHasFastInt8(), and IBuilderConfig::setInt8Calibrator() (removed as part of the weak-typing and implicit-quantization cleanup; refer to the NVIDIA TensorRT Migration Guide and the C++ API migration reference). ONNX Runtime 1.27 and later adopts the TensorRT 11.x API and supports TensorRT 11.0. This addresses the provider compilation failure described in the TensorRT 11.0.0 release notes Known Issues. For the direct ONNX-TRT build path (without ONNX Runtime), refer to Example Deployment Using ONNX.

  • Fixed a ~5% to 10% EfficientNet/RegNet GPU inference performance regression on the RTX PRO 6000 Blackwell platform. This addresses the performance regression described in the TensorRT 11.0.0 release notes Known Issues.

  • Fixed a crash in the IPluginV3 creator dispatch path during build or deserialization when a plugin advertised a PluginField with a nullptr data pointer and length == 0. Empty PluginField entries with null data are now handled safely, so plugin authors no longer need to populate unused fields with a non-null sentinel buffer. For background on related V3 migration pitfalls, refer to Known Migration Issues.

  • Fixed broken in-document hyperlinks in the README.md of the OSS samples/python/quickly_deployable_plugins Python sample, including references to the circular padding plugin section and other anchor targets. Cross-references in the README now resolve to the intended sections.

Known Issues#

Functional

  • The ECCommunicatorAPITests.SetCommunicatorFailsWithoutSupportedLayer and ECCommunicatorAPITests.SetCommunicatorSucceedsWithDistCollective runtime tests report errors under valgrind (host) and compute-sanitizer memcheck (GPU). Host-side memory leaks are caused by NCCL internal allocations during ncclCommInitRank and ncclCommSplit and reproduce on H100 and B100 platforms. GPU-side errors are NCCL kernel probing failures that occur on H100 (SM 90) when the sanitizer intercepts CUDA API errors before NCCL can handle them internally; B100 is not affected on the GPU side.

  • On CUDA versions prior to 13.2, the compute-sanitizer initcheck tool may flag Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors are false positives and can be safely ignored. To suppress them, upgrade to CUDA 13.2 or later.

  • Running TensorRT applications under Valgrind memcheck on CUDA 13.3 may report host-side memory leaks attributed to NVRTC components used during engine build and runtime compilation. These reports are not known to affect normal inference behavior. This will be addressed in a future CUDA or TensorRT release.

  • Running TensorRT inference under compute-sanitizer memcheck may report device memory errors during execution or engine teardown. These reports are not known to affect normal inference behavior when sanitizers are not enabled. This will be addressed in a future release.

  • inplace_add mini-sample of the quickly_deployable_plugins Python sample may produce incorrect outputs on Windows.

  • TensorRT may exit if inputs with invalid values are provided to the RoiAlign plugin (ROIAlign_TRT), especially if there is inconsistency in the indices specified in the batch_indices input and the actual batch size used.

  • trtexec exits with [E] Unknown option: --stronglyTyped when the deprecated --stronglyTyped flag is specified more than once on the command line. The first occurrence is accepted as a deprecated no-op, but the argument parser rejects subsequent occurrences instead of also treating them as no-ops. As a workaround, ensure --stronglyTyped appears at most once, or omit it entirely since strongly typed networks are now the default in TensorRT 11.0.0. This will be fixed in a future release.

  • On Windows, the TensorRT 11.0.0 runtime can fail to deserialize a Version Compatible engine that was built with TensorRT 10.1.0. Deserialization fails inside the runtime dispatch VC plugin loader when loading nvinfer_vc_plugin.dll (Windows ERROR_MOD_NOT_FOUND / error 126), surfaces as TensorRT error code 6 (API Usage Error), and leaves the returned ICudaEngine pointer null. Later TensorRT 10.x builders are not affected; only engines produced by TensorRT 10.1.0 have been observed to trigger this failure. As a workaround, rebuild the affected VC engines with a newer TensorRT 10.x release (10.2 or later) before deserializing them with the TensorRT 11.0.0 runtime on Windows. This will be addressed in a future release.

  • On NVIDIA Blackwell B100, some networks that use attention patterns can have invalid memory access or fail during IExecutionContext initialization with a CUDA driver error reporting CUDA_ERROR_INVALID_HANDLE (“Cannot pass CUkernel handle to this API”). The failure has been observed on B100 with Ubuntu 24.04 and CUDA 13.2.1. This will be addressed in a future release.

  • On Windows, Version Compatible engines built with TensorRT 10.1 through 10.4 that use the RoiAlign plugin (ROIAlign_TRT) cannot be deserialized with the TensorRT 11.x runtime due to a bug in how those engines were produced. The bug was fixed in TensorRT 10.5. Engines built with TensorRT 10.5 or later are expected to deserialize correctly. As a workaround, rebuild affected engines with TensorRT 10.5 or later before loading them with TensorRT 11.x.

  • Building some explicitly quantized FP8 networks with convolution layers that use large spatial strides (such as vision-transformer patch-embedding layers) may fail during engine build with no available tactics. This has been observed on NVIDIA Blackwell platforms. As a workaround, keep affected convolution layers in a higher-precision format rather than FP8 until fallback support is available. This will be addressed in a future release.

  • Inspecting very large engines with the Engine Inspector may fail on systems with limited host memory. The inspection process can exceed available memory and be terminated by the operating system without a TensorRT error message. This has been observed with very large diffusion-model engines. As a workaround, run engine inspection on a system with sufficient host memory. This will be addressed in a future release.

  • Some explicitly quantized NVFP4 diffusion transformer models may produce all-NaN outputs during inference on NVIDIA Blackwell platforms, resulting in incorrect results compared with the reference output. This will be addressed in a future release.

Performance

  • A non-zero tilingOptimizationLevel might introduce engine build failures for some networks on L4 GPUs.

  • The kREFIT and kREFIT_IDENTICAL have performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.

  • On RTX PRO 6000 Blackwell Max-Q (sm120, aarch64), some strongly-typed FP16 networks exhibit GPU time regressions compared with TensorRT 10.16.1. The slowdown is concentrated in fused-graph kernels, in particular, fused Conv/BiasAdd patterns and mean-reduction kernels, and has been observed at up to ~45% on individual networks. The regression is stable across reruns. This issue will be addressed in a future release.

  • Some networks may use more execution-context GPU memory in TensorRT 11.1 than in TensorRT 11.0 without a corresponding inference speedup. This will be addressed in a future release.

  • During engine build, some ONNX networks may consume substantially more host (CPU) memory in TensorRT 11.1 than in TensorRT 11.0. An increase of approximately 25–30% has been observed on NVIDIA H100 when building certain speech-synthesis models with TF32 enabled. This will be addressed in a future release.

  • Some transformer models that use multi-head attention, including vision-transformer architectures, may run slower on GPU in TensorRT 11.x than in TensorRT 10.16 due to additional layout-related overhead during inference. This will be addressed in a future release.