TensorRT 11.0.0 Release Notes#
These Release Notes apply to x86 Linux and Windows users, and ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux. This release includes several fixes from the previous TensorRT releases and additional changes.
Announcements#
RHEL 10 / Rocky Linux 10 support: RPM and tar packages are now available for Red Hat Enterprise Linux 10 and Rocky Linux 10.
New Migration Guide content: The Migration Guide now includes a complete TensorRT 10.x to 11.x migration path with chapters covering the C++ API, Python API,
trtexec, Safety Runtime,IEngineInspectorJSON output changes, and platform-specific guidance for NVIDIA DriveOS and Jetson/JetPack.Strongly typed networks are now the default:
createNetworkV2()produces a strongly typed network by default in TensorRT 11.0.0. Weak typing is no longer supported. The optimizer infers intermediate tensor types from the network input types and operator specifications and adheres to them strictly. See Strongly Typed Networks and the NVIDIA TensorRT Migration Guide for the upgrade path from weak typing.Package naming change: Tar and zip archive filenames have been restructured in TensorRT 11.x. Update any download scripts or CI pipelines that reference the old naming convention.
Format
Filename pattern
Tar (10.x)
TensorRT-<version>.<os>.<arch>-gnu.cuda-<cuda_version>.tar.gzTar (11.x)
TensorRT-<product>-<product_version>-<os>-<arch>-cuda-<cuda_version>-Release-external.tar.zstZip (10.x)
TensorRT-<version>.<os>.<arch>.cuda-<cuda_version>.zipZip (11.x)
TensorRT-<product>-<product_version>-<os>-<arch>-cuda-<cuda_version>-Release-external.zipThe TensorRT static libraries have been removed. If you are using the static libraries for building your application, migrate to building your application with the shared libraries. The following library files have been removed in TensorRT 11.0.0.
libnvinfer_static.a
libnvinfer_plugin_static.a
libnvinfer_lean_static.a
libnvinfer_dispatch_static.a
libnvinfer_vc_plugin_static.a
libnvonnxparser_static.a
libonnx_proto.a
Key Features and Enhancements#
Transformer Inference
Ragged batching for IAttention and IKVCacheUpdateLayer:
IAttentionnow supports packed (ragged) query and key/value tensors viasetQueryFormandsetKeyValueFormwith thekPACKED_NHDlayout, allowing variable-length sequences to be concatenated end-to-end without padding to the longest sequence in the batch. Per-sequence lengths are supplied viasetQueryLengthsandsetKeyValueLengths.IKVCacheUpdateLayersimilarly supports packed updates viasetUpdateFormandsetUpdateLengths. For more information, refer to the Fused Attention section.
MoE (Mixture of Experts)
Backend performance improvements for MoE inference: Builds on the MoE inference capability introduced in TensorRT 10.16 with significant backend optimizations. These optimizations close the performance gap between TensorRT’s out-of-the-box MoE inference and specialized external MoE kernels, particularly on Blackwell (SM10x and SM110) hardware. The previous guidance to keep token counts low (
seqLen≤ 16) no longer applies; MoE inference now performs well across a much broader range of token counts. For more information, refer to the MoE (Mixture of Experts) section.
Multi-Device Inference
Multi-Device Inference is now fully supported: In TensorRT 10.16, Multi-Device Inference was a preview feature that required manually enabling the
PreviewFeature::kMULTIDEVICE_RUNTIME_10_16flag in the builder config. Starting in TensorRT 11.0.0, this feature is fully supported and the flag is no longer needed. For more details, refer to the Multi-Device Inference section.Expanded Distributed Collective Operations: Optimized distributed workloads by introducing new collective operations to the IDistCollectiveLayer, specifically adding
AllToAll,Gather, andScatter. These ops require NCCL ≥ 2.7.0, which is why TensorRT 11.0.0 raises the minimum supported NCCL version (refer to the TensorRT Support Matrix for the full NCCL range).Improved NCCL Library Discovery: Implemented an automatic fallback mechanism for loading the NCCL library to increase environment compatibility and deployment flexibility. The runtime now checks for
libnccl.so.2before seamlessly falling back tolibnccl.so, preventing initialization failures caused by strictly requiring a single specific filename.New Context-Parallel Attention Python Sample: Added a new Python sample (attention-mdtrt) that demonstrates context-parallel attention that splits KV sequences across GPUs. For more information, refer to the Working with Transformers section.
API Enhancements
Internal Library Path API: Added
nvinfer1::setInternalLibraryPathC++ API to set the path for internal builder resource libraries (libnvinfer_builder_resource_*.so) when they are not in the system path. For more information, refer to the Set Internal Library Path API section.IAttention causal mask orientation control: Introduced the
CausalMaskKindenum andIAttention::setCausalKind/IAttention::getCausalKindAPIs to let users specify causal mask alignment (for example,kUPPER_LEFTorkNONE) without providing an explicit mask tensor. This enables clearer and more flexible configuration of causal masking behavior inaddAttention.
Open-Source Components
New TopK V3 plugin for large K values: A new
IPluginV3TopK plugin has been added to the TensorRT OSS components. The plugin ports the AIR TopK kernel from TensorRT-LLM and supports significantly largerKvalues than the native TensorRT kernel for the ONNXTopKoperator. In 11.0.0, the plugin is OSS-only. The core TensorRT builder does not automatically fall back to it for largeKvalues, so you must add the plugin to your network explicitly. Automatic fallback in the core TensorRT path may be added in a subsequent release.For changes to TensorRT open-source components, including samples, plugins, and parsers, refer to the TensorRT GitHub Releases page.
Documentation
Rewritten Best Practices and Benchmarking guide: The Best Practices landing page now frames performance work as a measure-then-optimize feedback loop, and the Performance Benchmarking chapter has been rewritten to cover both ONNX-TRT (
trtexec) and Torch-TRT workflows side by side in synchronized tabs. New or expanded coverage includes benchmarking basics, ModelOpt quantization, dynamic shapes, CUDA graphs, real input values, layer information and per-layer runtime, serialized engines and timing caches, built-in TensorRT profiling, and reading the Nsight Systems Timeline View.New V2 → V3 plugin migration guide: A dedicated Migrating V2 Plugins to IPluginV3 chapter in the Inference Library covers the lifecycle-phase API mapping, the known V3 pitfalls (including the
PluginFieldnull-data crash and the weakly-typed-network fusion crash), and a V2 → V3 performance-regression checklist.
Breaking ABI Changes#
Plugins which required cuDNN are no longer supported starting with TensorRT 11.0.0. cuDNN is no longer considered an optional dependency for TensorRT, but some applications may still use cuDNN directly independent from TensorRT. Check your application’s requirements.
TensorRT 11.0 removed weak typing APIs along with several other deprecated APIs. See the NVIDIA TensorRT Migration Guide for more details.
Breaking Behavioral Changes#
trtexec
The
--useCudaGraph,--noDataTransfers,--useSpinWait, and--separateProfileRunflags are now enabled by default and deprecated. Each flag is still accepted but has no effect. For replacement flags and details, refer to the trtexec migration page.The
--stronglyTypedflag has no effect in TensorRT 11.0.0+, since strongly typed networks are now the default. The flag is still accepted for backward compatibility.
Engine Inspector
The
Bindingsfield in theIEngineInspectorJSON output has been removed and replaced with theI/O Tensorsfield, which provides structured per-tensor metadata (name, I/O mode, data type, dimensions, location, and shape inference status).The combined
Format/DataTypefield in each layer’s information has been split into separateFormatandDataTypefields. For before/after examples, refer to the IEngineInspector migration page.
Compatibility#
For comprehensive platform compatibility, hardware requirements, and feature availability information, refer to the TensorRT Support Matrix. The support matrix provides detailed information about supported operating systems, CUDA versions, GPU architectures, precision modes, compiler requirements, and ONNX operator support for TensorRT 11.0.0.
Limitations#
The high-precision weights used in FP4 double quantization are not refittable.
Python samples do not support Python 3.13. Only the 3.13 Python bindings are currently supported.
Loops with scan outputs (
ILoopOutputLayerwithLoopOutputproperty being eitherLoopOutput::kCONCATENATEorLoopOutput::kREVERSE) must have the number of iterations set, that is, must have anITripLimitLayerwithTripLimit::kCOUNT. This requirement has always been present, but is now explicitly enforced instead of quietly having undefined behavior.ISelectLayermust have data inputs (thenInputandelseInput) of the same datatype.When implementing a custom layer using
IPluginV3plugin class where the custom layer has data-dependent shape (DDS), the size tensors must be of onlyINT64type and notINT32type, as the latter would result in a compilation failure. Related samples have been updated accordingly.Shuffle-op cannot be transformed to no-op for perf improvement in some cases. For the NCHW32 format, TensorRT takes the third-to-last dimension as the channel dimension. When a Shuffle-op is added like [N, ‘C’, H, 1] -> [‘N’, C, H], the channel dimension changes to N, then this op cannot be transformed to no-op.
The FP8 Convolutions on GPUs with SM89/90/120/121 do not support kernel sizes larger than 32 (for example, 7x7 convolutions); FP16 or FP32 fallback kernels will be used with suboptimal performance. Therefore, do not add FP8 Q/DQ ops before Convolutions with large kernel sizes for better performance.
When building the
nonZeroPluginsample on Windows, you might need to modify the CUDA version specified in theBuildCustomizationspaths in thevcxprojfile to match the installed version of CUDA.
Deprecated API Lifetime#
APIs deprecated in TensorRT 11.0.0 will be retained until 3/2027.
APIs deprecated in TensorRT 10.16 will be retained until 3/2027.
APIs deprecated in TensorRT 10.15 will be retained until 1/2027.
APIs deprecated in TensorRT 10.14 will be retained until 10/2026.
APIs deprecated in TensorRT 10.13 will be retained until 8/2026.
APIs deprecated in TensorRT 10.12 will be retained until 6/2026.
See also
- Migrating from TensorRT 10.x to 11.x
Step-by-step migration guide for upgrading from TensorRT 10.x.
- Upgrading TensorRT
Package-level upgrade instructions (pip, Debian, RPM, tar, zip).
Deprecated and Removed Features#
For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.
Implicit quantization has been removed. The
IInt8Calibratorclass and all subclasses (IInt8EntropyCalibrator,IInt8EntropyCalibrator2,IInt8MinMaxCalibrator,IInt8LegacyCalibrator),setInt8Calibrator(),setDynamicRange(),resetDynamicRange(), and all related calibration and dynamic-range APIs have been removed. Use explicit quantization with Q/DQ nodes instead. Models can be quantized using the TensorRT Model Optimizer or by manually inserting Q/DQ nodes. For migration details, refer to the C++ API and Python API migration sections.Built-in plugins deprecated before TensorRT 10 have been removed. The following plugin classes are no longer shipped:
BatchedNMS_TRT,BatchedNMSDynamic_TRT,BatchTilePlugin_TRT,Clip_TRT,CoordConvAC,CropAndResize,CustomGeluPluginDynamic,EfficientNMS_ONNX_TRT,LReLU_TRT,NMS_TRT,NMSDynamic_TRT,Normalize_TRT,Proposal,SingleStepLSTMPlugin,SpecialSlice_TRT, andSplit. For per-plugin replacements (most map to standardINetworkDefinitionAPIs such asaddNMS(),addActivation(),addNormalizationV2(),addSlice(), oraddLoop()), refer to the Removed C++ Plugins and Replacements table.The entire
IPluginV2family has been removed:IPluginV2,IPluginV2Ext,IPluginV2IOExt,IPluginV2DynamicExt,IPluginCreator,IPluginV2Layer, andINetworkDefinition::addPluginV2(). Migrate custom plugins toIPluginV3+IPluginCreatorV3One+addPluginV3(). For the V2 ↔ V3 API mapping, migration walkthrough, and known pitfalls, refer to Migrating V2 Plugins to IPluginV3 and the C++ API / Python API migration sections.All weak-typing APIs have been removed.
ITensor::setType,ITensor::setDynamicRange,ILayer::setPrecision,ILayer::setOutputType,INormalizationLayer::setComputePrecision, and the per-precisionBuilderFlagvalues (kFP16,kBF16,kFP8,kINT8,kINT4,kFP4,kOBEY_PRECISION_CONSTRAINTS,kPREFER_PRECISION_CONSTRAINTS) are no longer available. TheIRefittercalibration surface is also removed. Build with strongly typed networks instead — see the NVIDIA TensorRT Migration Guide for the conversion path.The OSS BERT plugin family is deprecated in this release and will be removed in a future release. The deprecated classes are
bertQKVToContextPlugin(also registered asCustomQKVToContextPluginDynamic, versions 1 through 4) andCustomEmbLayerNormPluginDynamic. The companionCustomFCPluginDynamic(fcPlugin) was already deprecated in TensorRT 10.6 and is part of the same family. Migrate to the native attention path usingINetworkDefinition::addAttentionV2()for QKV-to-context, and to standardIGatherLayerfollowed byINetworkDefinition::addNormalizationV2()for embedding + layer normalization. For per-plugin replacement guidance, refer to the Deprecated BERT Plugins section in the C++ API migration guide.The boolean causal parameter in
addAttentionand the legacyIAttention::setCausal(bool)/IAttention::getCausal()methods are deprecated in this release. UseIAttention::setCausalKind(CausalMaskKind)/IAttention::getCausalKind()instead for configuring causal mask orientation. These legacy boolean APIs will be removed in a future release.Windows 10 support is considered deprecated and Windows 10 will no longer be supported by TensorRT after October 2026. Windows 10 has been End-of-Life since October 2025. GeForce driver support for Windows 10 will also be reduced at that time. For more information, refer to GeForce Support Plan for Windows 10.
The TensorRT Python bindings for Python versions 3.8 and 3.9 are considered deprecated starting with this release. These Python versions are considered End-of-Life by Python upstream and have not been supported by TensorRT’s samples for multiple releases. These bindings will be removed in a future release.
Known Issues#
Functional
The
ECCommunicatorAPITests.SetCommunicatorFailsWithoutSupportedLayerandECCommunicatorAPITests.SetCommunicatorSucceedsWithDistCollectiveruntime tests report errors under valgrind (host) and compute-sanitizer memcheck (GPU). Host-side memory leaks are caused by NCCL internal allocations duringncclCommInitRankandncclCommSplitand reproduce on H100 and B100 platforms. GPU-side errors are NCCL kernel probing failures that occur on H100 (SM 90) when the sanitizer intercepts CUDA API errors before NCCL can handle them internally; B100 is not affected on the GPU side.Inputs to the
IRecurrenceLayermust always have the same shape. This means that ONNX models with loops whose recurrence inputs change shapes will be rejected.On CUDA versions prior to 13.2, the compute-sanitizer
initchecktool may flag false positiveUninitialized __global__ memory readerrors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored.For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in
kDLA_LINEARformat. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.inplace_addmini-sample of thequickly_deployable_pluginsPython sample may produce incorrect outputs on Windows.TensorRT may exit if inputs with invalid values are provided to the
RoiAlignplugin (ROIAlign_TRT), especially if there is inconsistency in the indices specified in thebatch_indicesinput and the actual batch size used.IPluginV3plugins that advertise aPluginFieldwith anullptrdatapointer andlength == 0can crash inside the V3 creator dispatch path during build or deserialization. As a workaround, populate everyPluginFieldentry with a non-null sentinel buffer, even when the value is unused at runtime. For details and a code snippet, refer to Known Migration Issues. Tracked by NVBug 5607435.IPluginV3plugins used in weakly-typed networks may trigger fusion-path crashes that did not exist forIPluginV2DynamicExt. The supported configuration in TensorRT 11.0.0 is to build with strongly-typed networks (which is the default in 11.0.0, since the precision-enablingBuilderFlagvalues have been removed). Refer to Known Migration Issues.trtexecexits with[E] Unknown option: --stronglyTypedwhen the deprecated--stronglyTypedflag is specified more than once on the command line. The first occurrence is accepted as a deprecated no-op, but the argument parser rejects subsequent occurrences instead of also treating them as no-ops. As a workaround, ensure--stronglyTypedappears at most once, or omit it entirely since strongly typed networks are now the default in TensorRT 11.0.0. This will be fixed in a future release.Several hyperlinks in the
README.mdof the OSSsamples/python/quickly_deployable_pluginsPython sample are broken, including references to the circular padding plugin section and other in-document anchors. Users following the README may be unable to navigate to the intended sections. The README content itself is still correct; only the cross-references are affected. This will be fixed in a future release.The
ECCommunicatorAPITests.SetCommunicatorWorksWithAllAllocationStrategiesruntime test can time out on B300 platforms when running against NCCL 2.29.4. The timeout is caused by a long (~21–22 second) cold-initialization latency in the firstncclCommInitRankcall, which NCCL attributes to its “kernels init” phase; the behavior reproduces with an NCCL-only reproducer and is not caused by TensorRT runtime logic. Upgrading to NCCL 2.30.x reduces the first-init cost to under one second. TensorRT requires only a minimum NCCL 2.x and supports newer minor versions, so applications affected by this latency on B300 can update NCCL independently. This will be addressed in a future release.ONNX Runtime release 1.24.4 and earlier cannot compile their TensorRT Execution Provider against TensorRT 11.0.0 because the provider still references APIs that were removed in TensorRT 11.0 — namely
IBuilder::platformHasFastFp16(),IBuilder::platformHasFastInt8(), andIBuilderConfig::setInt8Calibrator()(removed as part of the weak-typing and implicit-quantization cleanup; refer to the NVIDIA TensorRT Migration Guide). Building ONNX Runtime ≤ 1.24.4 against TensorRT 11.0.0 producesC2039 'member not found'compilation errors on these symbols. Until an ONNX Runtime release adopting the TensorRT 11.x API is available, applications integrating ONNX Runtime with TensorRT must either remain on TensorRT 10.x or apply a local patch that removes the references and substitutes the strongly-typed / explicit-quantization equivalents.On Windows, the TensorRT 11.0.0 runtime can fail to deserialize a Version Compatible engine that was built with TensorRT 10.1.0. Deserialization fails inside the runtime dispatch VC plugin loader when loading
nvinfer_vc_plugin.dll(WindowsERROR_MOD_NOT_FOUND/ error 126), surfaces as TensorRT error code 6 (API Usage Error), and leaves the returnedICudaEnginepointer null. Later TensorRT 10.x builders are not affected; only engines produced by TensorRT 10.1.0 have been observed to trigger this failure. As a workaround, rebuild the affected VC engines with a newer TensorRT 10.x release (10.2 or later) before deserializing them with the TensorRT 11.0.0 runtime on Windows. This will be addressed in a future release.On NVIDIA Blackwell B100, some networks that use attention patterns can have invalid memory access or fail during
IExecutionContextinitialization with a CUDA driver error reportingCUDA_ERROR_INVALID_HANDLE(“Cannot pass CUkernel handle to this API”). The failure has been observed on B100 with Ubuntu 24.04 and CUDA 13.2.1. This will be addressed in a future release.
Performance
EfficientNet/RegNet has a ~5 - 10% performance regression on RTX PRO 6000 Blackwell platform.
A non-zero
tilingOptimizationLevelmight introduce engine build failures for some networks on L4 GPUs.The
kREFITandkREFIT_IDENTICALhave performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.On RTX PRO 6000 Blackwell Max-Q (sm120, aarch64), some strongly-typed FP16 networks exhibit GPU time regressions compared with TensorRT 10.16.1. The slowdown is concentrated in fused-graph kernels — in particular, fused Conv/BiasAdd patterns and mean-reduction kernels — and has been observed at up to ~45% on individual networks. The regression is stable across reruns. This issue will be addressed in a future release.