TensorRT 10.1.0 Release Notes#
These Release Notes apply to x86 Linux and Windows users, ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux, and JetPack users. This release includes several fixes from the previous TensorRT releases and additional changes.
Announcements#
ONNX GraphSurgeon Packaging Change: ONNX GraphSurgeon is no longer included inside the TensorRT package. Remove any previous Debian or RPM packages that may have been installed by a previous release and instead install ONNX GraphSurgeon using pip. For more information, refer to onnx-graphsurgeon.
NVIDIA Volta Deprecation: NVIDIA Volta support (GPUs with compute capability 7.0) is deprecated starting with TensorRT 10.0 and may be removed after September 2024. Plan migration to supported GPU architectures.
Plugin Deprecations: Version 1 of the ROIAlign plugin (ROIAlign_TRT) has been deprecated in favor of version 2, which implements the IPluginV3 interface. Additionally, the TensorRT standard plugin shared library now only exports initLibNvInferPlugins, and symbols in the nvinfer1::plugin namespace are no longer exported.
API Deprecations: Several APIs have been deprecated, including weight streaming V1 APIs (superseded by V2), ONNX parser’s supportsModel method (superseded by V2 and new subgraph methods), and INT8 implicit quantization and calibrator APIs (use explicit quantization instead). For details, refer to the Deprecated and Removed Features section.
Key Features and Enhancements#
Weight Streaming Enhancements
Advanced Weight Streaming APIs: Added new APIs for improved weight streaming control including
setWeightStreamingBudgetV2,getWeightStreamingBudgetV2,getWeightStreamingAutomaticBudget, andgetWeightStreamingScratchMemorySize. Weight streaming now supports CUDA graphs and multiple execution contexts running in parallel, significantly improving flexibility for large model deployment.
Memory Management
Enhanced Device Memory APIs: Added new APIs for more precise device memory management:
ICudaEngine::getDeviceMemorySizeV2andIExecutionContext::setDeviceMemorySizeV2. The V1 APIs are deprecated in favor of these improved versions.
Plugin and ONNX Parser Enhancements
Shape Inputs for Custom Ops: When using the TensorRT ONNX parser, shape inputs can now be passed to custom ops supported by
IPluginV3-based plugins. The indices of the inputs to be interpreted as shape inputs must be indicated by a node attribute namedtensorrt_plugin_shape_input_indicescontaining a list of integers.Plugin Field Data Access: TensorRT Python bindings now natively support accessing the data attribute of a
PluginFieldofPluginFieldType.INT64andPluginFieldType.UNKNOWNas NumPy arrays. The NumPy functionstobytes()andfrombuffer()can be used during storage and retrieval to embed an arbitrary NumPy array in aPluginFieldType.UNKNOWN.IPluginV3 Auto-Tuning Improvements: During
IPluginV3auto-tuning, it is now guaranteed thatconfigurePlugin()is called with the current input/output format combination being timed beforegetValidTactics()is called. This enables advertising different sets of tactics per each input/output format combination.Exception Handling: Internal exceptions are now contained and won’t leak through the parser API boundaries, improving robustness and debugging experience.
Operator Support
New ONNX Operator Support: Added native TensorRT layer support for ONNX operator
isNaN, and added TensorRT plugin support for ONNX operatorDeformConv, expanding model compatibility.
Samples and Tools
Python Sample Addition: Added a new Python sample
non_zero_plugin, which is a Python version of the C++ samplesampleNonZeroPlugin, demonstrating plugin implementation in Python.
Bug Fixes and Performance
L4T Cross-Compilation: Fixed an issue where the
sampleNonZeroPluginsample failed to build when cross-compiling for L4T.CUDA Compatibility: Fixed
sampleNonZeroPluginto guarantee CUDA minor version compatibility across different CUDA Toolkit releases and driver versions.Plugin Registry: Fixed
IPluginRegistry::deregisterLibrary()to work correctly with plugin shared libraries that define thegetPluginCreators()entry point.Multi-Head Attention (MHA) Performance: Fixed unoptimized fused Multi-Head Attention (MHA) performance on A30 GPUs.
ONNX Runtime Integration: Fixed an issue preventing the use of prebuilt parser when building the TensorRT backend of ONNX Runtime.
Weight Stripping: Fixed an issue in Polygraphy’s
engine_from_networkAPI where enabling both refittable andstrip_plandid not properly strip engine weights.Large Tensor Support: Fixed TensorRT to support attention operations for tensors larger than
int32_tmaximum, eliminating the need for plugin workarounds.API Documentation: Corrected API documentation stating that
Castto INT8 format is possible; useQuantizeLinearnode instead.Refit Performance: Improved refit performance on Multi-Head Attention (MHA) and if/while loops with explicit quantization by eliminating slow
memcpyDeviceToHostoperations for Q/DQ scales.Stable Diffusion Performance: Fixed an up to 9% performance regression for Stable Diffusion VAE networks on A16 and A40 compared to TensorRT 9.2.
H100 Stability: Fixed hanging issues on H100 with the r550 CUDA driver when CUDA graphs were used, and resolved GPU hang issues with high persistent cache usage.
Instance Normalization: Fixed performance issues when running instance normalization layers on Arm Server Base System Architecture (SBSA).
L4T Platform Fixes: Fixed Compute Sanitizer hang issues on L4T and hardware forward compatibility issues for ViT, Swin-Transformers, and BERT networks in FP16 mode on L4T Concord.
LSTM Build Stability: Fixed an issue where LSTM networks could fail to build with timing cache enabled due to pre-existing cache entries.
API Enhancements
API Change Tracking: To view API changes between releases, refer to the TensorRT GitHub repository and use the compare tool.
Compatibility#
TensorRT 10.1.0 has been tested with the following:
PyTorch >= 2.0 (refer to the
requirements.txtfile for each sample)
This TensorRT release supports NVIDIA CUDA:
This TensorRT release requires at least NVIDIA driver r450 on Linux or r452 on Windows as required by CUDA 11.0, the minimum CUDA version supported by this TensorRT release.
Limitations#
On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.
The DLA compiler can remove identity transposes but cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshaping). For example, given a TensorRT
IShuffleLayerconsisting of two non-trivial transposes and an identity reshape in between, the shuffle layer is translated into two consecutive DLA transpose layers unless the user merges the transposes manually in the model definition in advance.nvinfer1::UnaryOperation::kROUNDornvinfer1::UnaryOperation::kSIGNoperations ofIUnaryLayerare not supported in the implicit batch mode.For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset containing the corresponding function ops, such as opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to the corresponding implementation with primitive ops for normalization layers.
The
kREFITandkREFIT_IDENTICALhave performance regressions compared with non-refit engines where convolution layers are present within a branch or loop and the precision is FP16/INT8. This issue will be addressed in future releases.Weight streaming mainly supports GEMM-based networks like Transformers for now. Convolution-based networks may have only a few weights that can be streamed.
Deprecated API Lifetime#
APIs deprecated in TensorRT 10.1 will be retained until 5/2025.
APIs deprecated in TensorRT 10.0 will be retained until 3/2025.
Deprecated and Removed Features#
For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.
Deprecated NVIDIA Volta support (GPUs with compute capability 7.0) starting with TensorRT 10.0. Volta support may be removed after 9/2024.
Version 1 of the ROIAlign plugin (
ROIAlign_TRT), which implementedIPluginV2DynamicExt, is deprecated. It is superseded by version 2, which implementsIPluginV3.The TensorRT standard plugin shared library (
libnvinfer_plugin.so/nvinfer_plugin.dll) only exportsinitLibNvInferPlugins. No symbols in thenvinfer1::pluginnamespace are exported anymore.Deprecated
IParser::supportsModeland replaced this method withIParser::supportsModelV2,IParser::getNbSubgraphs,IParser::isSubgraphSupported, andIParser::getSubgraphNodes.Deprecated some weight streaming APIs including
setWeightStreamingBudget,getWeightStreamingBudget, andgetMinimumWeightStreamingBudget. Replaced by new versions of weight streaming APIs.Deprecated INT8 implicit quantization and calibrator APIs including
dynamicRangeIsSet,CalibrationAlgoType,IInt8Calibrator,IInt8EntropyCalibrator,IInt8EntropyCalibrator2,IInt8MinMaxCalibrator,IInt8Calibrator,setInt8Calibrator,getInt8Calibrator,setCalibrationProfile,getCalibrationProfile,setDynamicRange,getDynamicRangeMin,getDynamicRangeMax, andgetTensorsWithDynamicRange. They may not give the optimal performance and accuracy. As a workaround, use INT8 explicit quantization instead.
Fixed Issues#
The sampleNonZeroPlugin sample failed to build when cross compiling for L4T. The workaround was to continue building the other samples by modifying
samples/Makefileand removing the line containing sampleNonZeroPlugin. This issue has been fixed.The sampleNonZeroPlugin sample did not guarantee CUDA minor version compatibility. That is, if built against a newer CUDA Toolkit release, it may not function properly on older drivers, even within the same major CUDA release family. This issue has been fixed.
IPluginRegistry::deregisterLibrary()did not work with plugin shared libraries with the defined entry pointgetPluginCreators().IPluginRegistry::loadLibrary()was not impacted. The workaround was to deregister the plugins that were contained in such a library, manually query the library forgetPluginCreators(), and invokeIPluginRegistry::deregisterCreator()for each creator retrieved. This issue has been fixed.On A30, some fused MHA (multi-head attention) performance was not optimized yet. This issue has been fixed.
If building the TensorRT backend of ONNX runtime we are not able to use the prebuilt parser. This issue has been fixed.
When using the Polygraphy
engine_from_networkAPI, if we enabled both refittable andstrip_planin thecreate_config, the final engine weights were not stripped. The workaround was, only includestrip_planincreate_config. This issue has been fixed.TensorRT did not support attention operations for tensors larger than
int32_tmaximum. Plugins could be used to workaround this issue. The issue has been fixed.The API docs incorrectly stated that
Castto the INT8 format is possible but this path is not supported. Use aQuantizeLinearnode instead. This issue has been fixed in the API docs.When using refit on multi-head attention or if/while loops with explicit quantization, the refit process could have been slow due to the implementation’s
memcpyDeviceToHostfor the Q/DQ scales. This issue has been fixed.There was an up to 9% performance regression for StableDiffusion VAE networks on A16 and A40 compared to TensorRT 9.2. The workaround was to disable the
kNATIVE_INSTANCENORMflag in ONNX parser or add the--pluginInstanceNormflag totrtexec. This issue has been fixed.There was a small chance that TensorRT would hang when running on H100 with the r550 CUDA driver when CUDA graphs were used. The workaround was to use the r535 CUDA driver instead or to avoid using CUDA graphs. This issue has been fixed.
There was a known issue on H100 that may have led to a GPU hang when running TensorRT with high persistentCache usage. The workaround was to limit the usage to 40% of L2 cache size. This issue has been fixed.
There was a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA). This issue has been fixed.
There were some issues when running TensorRT-LLM with TensorRT 10.0 with the
StronglyTypedmode enabled. The workaround was to disableStronglyTypedmode.Running sync/race check with newer Compute Sanitizer on L4T may have hit a hang issue. The workaround was to try an older version of Compute Sanitizer. This issue has been fixed.
Hardware forward compatibility (HFC) was broken on L4T Concord for ViT, Swin-Transformers, and BERT networks in FP16 mode. The workaround was to only use FP32 mode on L4T Concord or turn off HFC. This issue has been fixed.
LSTM networks could fail to build with timing cache enabled. This has been observed on one GPU platform and only when building with a cache that has pre-existing entries. Error signature would contain
Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:operator():1502] Internal bug. Please report with reproduction steps.
The workaround was to disable the timing cache or start a fresh one. This issue has been fixed.
Known Issues#
Functional
The
tensorrtPython metapackage does not pin the version for the Python module dependencytensorrt-cu12. For example, usingpip install tensorrt==10.0.1will installtensorrt-cu12==10.1.0rather thantensorrt-cu12==10.0.1as expected. The workaround is to instead specify the package name including the CUDA version, such aspip install tensorrt-cu12==10.0.1. This issue will be fixed in TensorRT 10.2.The Python sample yolo_v3_onnx does not support Python 3.12. Support will be added in 10.2.
If TensorRT 8.6 or 9.x was installed using the Python Package Index (PyPI), you cannot upgrade TensorRT to 10.x using PyPI. You must first uninstall TensorRT using
pip uninstall tensorrt tensorrt-libs tensorrt-bindings, then reinstall TensorRT usingpip install tensorrt. This will remove the previous TensorRT version and install the latest TensorRT 10.x. This step is required because the suffix-cuXXwas added to the Python package names, which prevents the upgrade from working properly.Allocated GPU memory during autotuning might not be freed correctly if allocation failed due to inadequate resources, causing build time memory usage to be larger than that of inference time.
CUDA compute sanitizer may report racecheck hazards for some legacy kernels. However, related kernels do not have functional issues at runtime.
The compute sanitizer
initchecktool may flag false positiveUninitialized __global__ memory readerrors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.Multi-Head Attention fusion might not happen and affect performance if the number of heads is small.
If a network has a tensor of type bool with an implicitly data-dependent shape, engine building will likely fail.
An occurrence of use-after-free in NVRTC has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.
There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option
--keep-debuginfo=yesto the Valgrind command line to suppress these errors.{ Memory leak errors with dlopen. Memcheck:Leak match-leak-kinds: definite ... fun:*dlopen* ... } { Memory leak errors with nvrtc Memcheck:Leak match-leak-kinds: definite fun:malloc obj:*libnvrtc.so* ... }
SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a
could not find any implementationerror while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.Installing the
cuda-compat-11-4package may interfere with CUDA-enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove thecuda-compat-11-4package or upgrade the driver to r470.For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.
For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in
kDLA_LINEARformat. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.Exclusive padding with
kAVERAGEpooling is not supported.The Valgrind tool found a memory leak on L4T with CUDA 12.4 due to a known driver issue. This is expected to be fixed in CUDA 12.6.
Asynchronous CUDA calls are not supported in the user-defined
processDebugTensorfunction for the debug tensor feature due to a bug in Windows 10.For sample python/efficientdet and python/tensorflow_object_detection_api, the ONNX version needs to be manually downgraded in their respective
requirements.txtfile to 1.14 for the sample to function correctly. - There is a known accuracy issue when binding an INT4 tensor as a network output. To workaround this, add anIDequantizeLayerbefore the output.There is a known accuracy issue when the network contains two consecutive GEMV operations (that is, MatrixMultiply with
gemmMorgemmN == 1). To workaround this issue, try padding the MatrixMultiply input to have dimensions greater than1.Engine building with weight streaming enabled will fail when the model size is larger than the free device memory size. This issue will be fixed in the next version.
Performance
There is an up to 10% performance regression for ConvNext on NVIDIA Orin compared to TensorRT 9.3.
There are known performance gaps between engines built with REFIT enabled and engines built with REFIT disabled.
There is an up to 4x performance regression for networks containing
GridSampleops compared to TensorRT 9.2.Up to 60 MB engine size fluctuations for the BERT-Large INT8-QDQ model on Orin due to unstable tactic selection among tactics.
Up to 16% performance regression for BasicUNet, DynUNet, and HighResNet in INT8 precision compared to TensorRT 9.3.
There are performance gaps for StableDiffusion networks between Windows and Linux platforms.
Up to 40-second increase in engine building for BART networks on NVIDIA Hopper GPUs.
Up to 20-second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.
Up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert-like models due to additional tactics available for evaluation.
Up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.
Up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.
Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.
For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.
There is a known issue with DLA clocks that requires users to reboot the system after changing the
nvpmodelpower mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.Up to 5% performance drop for networks using sparsity in FP16 precision.
Up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.
Up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. Enable FP16 and disable INT8 in the builder config to work around this.
In explicitly quantized networks, a group convolution with a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, NVIDIA Hopper may fall back to FP32-IN-FP32-OUT if the input channel count is small.
Weight streaming performance may decrease when you create execution contexts with multiple optimization profiles using external device memory and call
setDeviceMemory/setDeviceMemoryV2beforesetOptimizationProfileAsync. This issue will be fixed in the next version.