TensorRT 11.0.0 Release Notes#

These Release Notes apply to x86 Linux and Windows users, and ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux. This release includes several fixes from the previous TensorRT releases and additional changes.

Announcements#

  • RHEL 10 / Rocky Linux 10 support: RPM and tar packages are now available for Red Hat Enterprise Linux 10 and Rocky Linux 10.

  • New Migration Guide content: The Migration Guide now includes a complete TensorRT 10.x to 11.x migration path with chapters covering the C++ API, Python API, trtexec, Safety Runtime, IEngineInspector JSON output changes, and platform-specific guidance for NVIDIA DriveOS and Jetson/JetPack.

  • Strongly typed networks are now the default: createNetworkV2() produces a strongly typed network by default in TensorRT 11.0.0. Weak typing is no longer supported. The optimizer infers intermediate tensor types from the network input types and operator specifications and adheres to them strictly. See Strongly Typed Networks and the NVIDIA TensorRT Migration Guide for the upgrade path from weak typing.

  • Package naming change: Tar and zip archive filenames have been restructured in TensorRT 11.x. Update any download scripts or CI pipelines that reference the old naming convention.

    Format

    Filename pattern

    Tar (10.x)

    TensorRT-<version>.<os>.<arch>-gnu.cuda-<cuda_version>.tar.gz

    Tar (11.x)

    TensorRT-<product>-<product_version>-<os>-<arch>-cuda-<cuda_version>-Release-external.tar.zst

    Zip (10.x)

    TensorRT-<version>.<os>.<arch>.cuda-<cuda_version>.zip

    Zip (11.x)

    TensorRT-<product>-<product_version>-<os>-<arch>-cuda-<cuda_version>-Release-external.zip

  • The TensorRT static libraries have been removed. If you are using the static libraries for building your application, migrate to building your application with the shared libraries. The following library files have been removed in TensorRT 11.0.0.

    • libnvinfer_static.a

    • libnvinfer_plugin_static.a

    • libnvinfer_lean_static.a

    • libnvinfer_dispatch_static.a

    • libnvinfer_vc_plugin_static.a

    • libnvonnxparser_static.a

    • libonnx_proto.a

Key Features and Enhancements#

Transformer Inference

  • Ragged batching for IAttention and IKVCacheUpdateLayer: IAttention now supports packed (ragged) query and key/value tensors via setQueryForm and setKeyValueForm with the kPACKED_NHD layout, allowing variable-length sequences to be concatenated end-to-end without padding to the longest sequence in the batch. Per-sequence lengths are supplied via setQueryLengths and setKeyValueLengths. IKVCacheUpdateLayer similarly supports packed updates via setUpdateForm and setUpdateLengths. For more information, refer to the Fused Attention section.

MoE (Mixture of Experts)

  • Backend performance improvements for MoE inference: Builds on the MoE inference capability introduced in TensorRT 10.16 with significant backend optimizations. These optimizations close the performance gap between TensorRT’s out-of-the-box MoE inference and specialized external MoE kernels, particularly on Blackwell (SM10x and SM110) hardware. The previous guidance to keep token counts low (seqLen ≤ 16) no longer applies; MoE inference now performs well across a much broader range of token counts. For more information, refer to the MoE (Mixture of Experts) section.

Multi-Device Inference

  • Multi-Device Inference is now fully supported: In TensorRT 10.16, Multi-Device Inference was a preview feature that required manually enabling the PreviewFeature::kMULTIDEVICE_RUNTIME_10_16 flag in the builder config. Starting in TensorRT 11.0.0, this feature is fully supported and the flag is no longer needed. For more details, refer to the Multi-Device Inference section.

  • Expanded Distributed Collective Operations: Optimized distributed workloads by introducing new collective operations to the IDistCollectiveLayer, specifically adding AllToAll, Gather, and Scatter. These ops require NCCL ≥ 2.7.0, which is why TensorRT 11.0.0 raises the minimum supported NCCL version (refer to the TensorRT Support Matrix for the full NCCL range).

  • Improved NCCL Library Discovery: Implemented an automatic fallback mechanism for loading the NCCL library to increase environment compatibility and deployment flexibility. The runtime now checks for libnccl.so.2 before seamlessly falling back to libnccl.so, preventing initialization failures caused by strictly requiring a single specific filename.

  • New Context-Parallel Attention Python Sample: Added a new Python sample (attention-mdtrt) that demonstrates context-parallel attention that splits KV sequences across GPUs. For more information, refer to the Working with Transformers section.

API Enhancements

  • Internal Library Path API: Added nvinfer1::setInternalLibraryPath C++ API to set the path for internal builder resource libraries (libnvinfer_builder_resource_*.so) when they are not in the system path. For more information, refer to the Set Internal Library Path API section.

  • IAttention causal mask orientation control: Introduced the CausalMaskKind enum and IAttention::setCausalKind / IAttention::getCausalKind APIs to let users specify causal mask alignment (for example, kUPPER_LEFT or kNONE) without providing an explicit mask tensor. This enables clearer and more flexible configuration of causal masking behavior in addAttention.

Open-Source Components

  • New TopK V3 plugin for large K values: A new IPluginV3 TopK plugin has been added to the TensorRT OSS components. The plugin ports the AIR TopK kernel from TensorRT-LLM and supports significantly larger K values than the native TensorRT kernel for the ONNX TopK operator. In 11.0.0, the plugin is OSS-only. The core TensorRT builder does not automatically fall back to it for large K values, so you must add the plugin to your network explicitly. Automatic fallback in the core TensorRT path may be added in a subsequent release.

  • For changes to TensorRT open-source components, including samples, plugins, and parsers, refer to the TensorRT GitHub Releases page.

Documentation

  • Rewritten Best Practices and Benchmarking guide: The Best Practices landing page now frames performance work as a measure-then-optimize feedback loop, and the Performance Benchmarking chapter has been rewritten to cover both ONNX-TRT (trtexec) and Torch-TRT workflows side by side in synchronized tabs. New or expanded coverage includes benchmarking basics, ModelOpt quantization, dynamic shapes, CUDA graphs, real input values, layer information and per-layer runtime, serialized engines and timing caches, built-in TensorRT profiling, and reading the Nsight Systems Timeline View.

  • New V2 → V3 plugin migration guide: A dedicated Migrating V2 Plugins to IPluginV3 chapter in the Inference Library covers the lifecycle-phase API mapping, the known V3 pitfalls (including the PluginField null-data crash and the weakly-typed-network fusion crash), and a V2 → V3 performance-regression checklist.

Breaking ABI Changes#

  • Plugins which required cuDNN are no longer supported starting with TensorRT 11.0.0. cuDNN is no longer considered an optional dependency for TensorRT, but some applications may still use cuDNN directly independent from TensorRT. Check your application’s requirements.

  • TensorRT 11.0 removed weak typing APIs along with several other deprecated APIs. See the NVIDIA TensorRT Migration Guide for more details.

Breaking Behavioral Changes#

trtexec

  • The --useCudaGraph, --noDataTransfers, --useSpinWait, and --separateProfileRun flags are now enabled by default and deprecated. Each flag is still accepted but has no effect. For replacement flags and details, refer to the trtexec migration page.

  • The --stronglyTyped flag has no effect in TensorRT 11.0.0+, since strongly typed networks are now the default. The flag is still accepted for backward compatibility.

Engine Inspector

  • The Bindings field in the IEngineInspector JSON output has been removed and replaced with the I/O Tensors field, which provides structured per-tensor metadata (name, I/O mode, data type, dimensions, location, and shape inference status).

  • The combined Format/DataType field in each layer’s information has been split into separate Format and DataType fields. For before/after examples, refer to the IEngineInspector migration page.

Compatibility#

For comprehensive platform compatibility, hardware requirements, and feature availability information, refer to the TensorRT Support Matrix. The support matrix provides detailed information about supported operating systems, CUDA versions, GPU architectures, precision modes, compiler requirements, and ONNX operator support for TensorRT 11.0.0.

Limitations#

  • The high-precision weights used in FP4 double quantization are not refittable.

  • Python samples do not support Python 3.13. Only the 3.13 Python bindings are currently supported.

  • Loops with scan outputs (ILoopOutputLayer with LoopOutput property being either LoopOutput::kCONCATENATE or LoopOutput::kREVERSE) must have the number of iterations set, that is, must have an ITripLimitLayer with TripLimit::kCOUNT. This requirement has always been present, but is now explicitly enforced instead of quietly having undefined behavior.

  • ISelectLayer must have data inputs (thenInput and elseInput) of the same datatype.

  • When implementing a custom layer using IPluginV3 plugin class where the custom layer has data-dependent shape (DDS), the size tensors must be of only INT64 type and not INT32 type, as the latter would result in a compilation failure. Related samples have been updated accordingly.

  • Shuffle-op cannot be transformed to no-op for perf improvement in some cases. For the NCHW32 format, TensorRT takes the third-to-last dimension as the channel dimension. When a Shuffle-op is added like [N, ‘C’, H, 1] -> [‘N’, C, H], the channel dimension changes to N, then this op cannot be transformed to no-op.

  • The FP8 Convolutions on GPUs with SM89/90/120/121 do not support kernel sizes larger than 32 (for example, 7x7 convolutions); FP16 or FP32 fallback kernels will be used with suboptimal performance. Therefore, do not add FP8 Q/DQ ops before Convolutions with large kernel sizes for better performance.

  • When building the nonZeroPlugin sample on Windows, you might need to modify the CUDA version specified in the BuildCustomizations paths in the vcxproj file to match the installed version of CUDA.

Deprecated API Lifetime#

  • APIs deprecated in TensorRT 11.0.0 will be retained until 3/2027.

  • APIs deprecated in TensorRT 10.16 will be retained until 3/2027.

  • APIs deprecated in TensorRT 10.15 will be retained until 1/2027.

  • APIs deprecated in TensorRT 10.14 will be retained until 10/2026.

  • APIs deprecated in TensorRT 10.13 will be retained until 8/2026.

  • APIs deprecated in TensorRT 10.12 will be retained until 6/2026.

See also

Migrating from TensorRT 10.x to 11.x

Step-by-step migration guide for upgrading from TensorRT 10.x.

Upgrading TensorRT

Package-level upgrade instructions (pip, Debian, RPM, tar, zip).

Deprecated and Removed Features#

  • For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.

  • Implicit quantization has been removed. The IInt8Calibrator class and all subclasses (IInt8EntropyCalibrator, IInt8EntropyCalibrator2, IInt8MinMaxCalibrator, IInt8LegacyCalibrator), setInt8Calibrator(), setDynamicRange(), resetDynamicRange(), and all related calibration and dynamic-range APIs have been removed. Use explicit quantization with Q/DQ nodes instead. Models can be quantized using the TensorRT Model Optimizer or by manually inserting Q/DQ nodes. For migration details, refer to the C++ API and Python API migration sections.

  • Built-in plugins deprecated before TensorRT 10 have been removed. The following plugin classes are no longer shipped: BatchedNMS_TRT, BatchedNMSDynamic_TRT, BatchTilePlugin_TRT, Clip_TRT, CoordConvAC, CropAndResize, CustomGeluPluginDynamic, EfficientNMS_ONNX_TRT, LReLU_TRT, NMS_TRT, NMSDynamic_TRT, Normalize_TRT, Proposal, SingleStepLSTMPlugin, SpecialSlice_TRT, and Split. For per-plugin replacements (most map to standard INetworkDefinition APIs such as addNMS(), addActivation(), addNormalizationV2(), addSlice(), or addLoop()), refer to the Removed C++ Plugins and Replacements table.

  • The entire IPluginV2 family has been removed: IPluginV2, IPluginV2Ext, IPluginV2IOExt, IPluginV2DynamicExt, IPluginCreator, IPluginV2Layer, and INetworkDefinition::addPluginV2(). Migrate custom plugins to IPluginV3 + IPluginCreatorV3One + addPluginV3(). For the V2 ↔ V3 API mapping, migration walkthrough, and known pitfalls, refer to Migrating V2 Plugins to IPluginV3 and the C++ API / Python API migration sections.

  • All weak-typing APIs have been removed. ITensor::setType, ITensor::setDynamicRange, ILayer::setPrecision, ILayer::setOutputType, INormalizationLayer::setComputePrecision, and the per-precision BuilderFlag values (kFP16, kBF16, kFP8, kINT8, kINT4, kFP4, kOBEY_PRECISION_CONSTRAINTS, kPREFER_PRECISION_CONSTRAINTS) are no longer available. The IRefitter calibration surface is also removed. Build with strongly typed networks instead — see the NVIDIA TensorRT Migration Guide for the conversion path.

  • The OSS BERT plugin family is deprecated in this release and will be removed in a future release. The deprecated classes are bertQKVToContextPlugin (also registered as CustomQKVToContextPluginDynamic, versions 1 through 4) and CustomEmbLayerNormPluginDynamic. The companion CustomFCPluginDynamic (fcPlugin) was already deprecated in TensorRT 10.6 and is part of the same family. Migrate to the native attention path using INetworkDefinition::addAttentionV2() for QKV-to-context, and to standard IGatherLayer followed by INetworkDefinition::addNormalizationV2() for embedding + layer normalization. For per-plugin replacement guidance, refer to the Deprecated BERT Plugins section in the C++ API migration guide.

  • The boolean causal parameter in addAttention and the legacy IAttention::setCausal(bool) / IAttention::getCausal() methods are deprecated in this release. Use IAttention::setCausalKind(CausalMaskKind) / IAttention::getCausalKind() instead for configuring causal mask orientation. These legacy boolean APIs will be removed in a future release.

  • Windows 10 support is considered deprecated and Windows 10 will no longer be supported by TensorRT after October 2026. Windows 10 has been End-of-Life since October 2025. GeForce driver support for Windows 10 will also be reduced at that time. For more information, refer to GeForce Support Plan for Windows 10.

  • The TensorRT Python bindings for Python versions 3.8 and 3.9 are considered deprecated starting with this release. These Python versions are considered End-of-Life by Python upstream and have not been supported by TensorRT’s samples for multiple releases. These bindings will be removed in a future release.

Known Issues#

Functional

  • The ECCommunicatorAPITests.SetCommunicatorFailsWithoutSupportedLayer and ECCommunicatorAPITests.SetCommunicatorSucceedsWithDistCollective runtime tests report errors under valgrind (host) and compute-sanitizer memcheck (GPU). Host-side memory leaks are caused by NCCL internal allocations during ncclCommInitRank and ncclCommSplit and reproduce on H100 and B100 platforms. GPU-side errors are NCCL kernel probing failures that occur on H100 (SM 90) when the sanitizer intercepts CUDA API errors before NCCL can handle them internally; B100 is not affected on the GPU side.

  • Inputs to the IRecurrenceLayer must always have the same shape. This means that ONNX models with loops whose recurrence inputs change shapes will be rejected.

  • On CUDA versions prior to 13.2, the compute-sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored.

  • For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in kDLA_LINEAR format. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.

  • inplace_add mini-sample of the quickly_deployable_plugins Python sample may produce incorrect outputs on Windows.

  • TensorRT may exit if inputs with invalid values are provided to the RoiAlign plugin (ROIAlign_TRT), especially if there is inconsistency in the indices specified in the batch_indices input and the actual batch size used.

  • IPluginV3 plugins that advertise a PluginField with a nullptr data pointer and length == 0 can crash inside the V3 creator dispatch path during build or deserialization. As a workaround, populate every PluginField entry with a non-null sentinel buffer, even when the value is unused at runtime. For details and a code snippet, refer to Known Migration Issues. Tracked by NVBug 5607435.

  • IPluginV3 plugins used in weakly-typed networks may trigger fusion-path crashes that did not exist for IPluginV2DynamicExt. The supported configuration in TensorRT 11.0.0 is to build with strongly-typed networks (which is the default in 11.0.0, since the precision-enabling BuilderFlag values have been removed). Refer to Known Migration Issues.

  • trtexec exits with [E] Unknown option: --stronglyTyped when the deprecated --stronglyTyped flag is specified more than once on the command line. The first occurrence is accepted as a deprecated no-op, but the argument parser rejects subsequent occurrences instead of also treating them as no-ops. As a workaround, ensure --stronglyTyped appears at most once, or omit it entirely since strongly typed networks are now the default in TensorRT 11.0.0. This will be fixed in a future release.

  • Several hyperlinks in the README.md of the OSS samples/python/quickly_deployable_plugins Python sample are broken, including references to the circular padding plugin section and other in-document anchors. Users following the README may be unable to navigate to the intended sections. The README content itself is still correct; only the cross-references are affected. This will be fixed in a future release.

  • The ECCommunicatorAPITests.SetCommunicatorWorksWithAllAllocationStrategies runtime test can time out on B300 platforms when running against NCCL 2.29.4. The timeout is caused by a long (~21–22 second) cold-initialization latency in the first ncclCommInitRank call, which NCCL attributes to its “kernels init” phase; the behavior reproduces with an NCCL-only reproducer and is not caused by TensorRT runtime logic. Upgrading to NCCL 2.30.x reduces the first-init cost to under one second. TensorRT requires only a minimum NCCL 2.x and supports newer minor versions, so applications affected by this latency on B300 can update NCCL independently. This will be addressed in a future release.

  • ONNX Runtime release 1.24.4 and earlier cannot compile their TensorRT Execution Provider against TensorRT 11.0.0 because the provider still references APIs that were removed in TensorRT 11.0 — namely IBuilder::platformHasFastFp16(), IBuilder::platformHasFastInt8(), and IBuilderConfig::setInt8Calibrator() (removed as part of the weak-typing and implicit-quantization cleanup; refer to the NVIDIA TensorRT Migration Guide). Building ONNX Runtime ≤ 1.24.4 against TensorRT 11.0.0 produces C2039 'member not found' compilation errors on these symbols. Until an ONNX Runtime release adopting the TensorRT 11.x API is available, applications integrating ONNX Runtime with TensorRT must either remain on TensorRT 10.x or apply a local patch that removes the references and substitutes the strongly-typed / explicit-quantization equivalents.

  • On Windows, the TensorRT 11.0.0 runtime can fail to deserialize a Version Compatible engine that was built with TensorRT 10.1.0. Deserialization fails inside the runtime dispatch VC plugin loader when loading nvinfer_vc_plugin.dll (Windows ERROR_MOD_NOT_FOUND / error 126), surfaces as TensorRT error code 6 (API Usage Error), and leaves the returned ICudaEngine pointer null. Later TensorRT 10.x builders are not affected; only engines produced by TensorRT 10.1.0 have been observed to trigger this failure. As a workaround, rebuild the affected VC engines with a newer TensorRT 10.x release (10.2 or later) before deserializing them with the TensorRT 11.0.0 runtime on Windows. This will be addressed in a future release.

  • On NVIDIA Blackwell B100, some networks that use attention patterns can have invalid memory access or fail during IExecutionContext initialization with a CUDA driver error reporting CUDA_ERROR_INVALID_HANDLE (“Cannot pass CUkernel handle to this API”). The failure has been observed on B100 with Ubuntu 24.04 and CUDA 13.2.1. This will be addressed in a future release.

Performance

  • EfficientNet/RegNet has a ~5 - 10% performance regression on RTX PRO 6000 Blackwell platform.

  • A non-zero tilingOptimizationLevel might introduce engine build failures for some networks on L4 GPUs.

  • The kREFIT and kREFIT_IDENTICAL have performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.

  • On RTX PRO 6000 Blackwell Max-Q (sm120, aarch64), some strongly-typed FP16 networks exhibit GPU time regressions compared with TensorRT 10.16.1. The slowdown is concentrated in fused-graph kernels — in particular, fused Conv/BiasAdd patterns and mean-reduction kernels — and has been observed at up to ~45% on individual networks. The regression is stable across reruns. This issue will be addressed in a future release.