TensorRT 10.1.0 Release Notes#

These Release Notes apply to x86 Linux and Windows users, ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux, and JetPack users. This release includes several fixes from the previous TensorRT releases and additional changes.

Announcements#

ONNX GraphSurgeon Packaging Change: ONNX GraphSurgeon is no longer included inside the TensorRT package. Remove any previous Debian or RPM packages that may have been installed by a previous release and instead install ONNX GraphSurgeon using pip. For more information, refer to onnx-graphsurgeon.

NVIDIA Volta Deprecation: NVIDIA Volta support (GPUs with compute capability 7.0) is deprecated starting with TensorRT 10.0 and may be removed after September 2024. Plan migration to supported GPU architectures.

Plugin Deprecations: Version 1 of the ROIAlign plugin (ROIAlign_TRT) has been deprecated in favor of version 2, which implements the IPluginV3 interface. Additionally, the TensorRT standard plugin shared library now only exports initLibNvInferPlugins, and symbols in the nvinfer1::plugin namespace are no longer exported.

API Deprecations: Several APIs have been deprecated, including weight streaming V1 APIs (superseded by V2), ONNX parser’s supportsModel method (superseded by V2 and new subgraph methods), and INT8 implicit quantization and calibrator APIs (use explicit quantization instead). For details, refer to the Deprecated and Removed Features section.

Key Features and Enhancements#

Weight Streaming Enhancements

  • Advanced Weight Streaming APIs: Added new APIs for improved weight streaming control including setWeightStreamingBudgetV2, getWeightStreamingBudgetV2, getWeightStreamingAutomaticBudget, and getWeightStreamingScratchMemorySize. Weight streaming now supports CUDA graphs and multiple execution contexts running in parallel, significantly improving flexibility for large model deployment.

Memory Management

  • Enhanced Device Memory APIs: Added new APIs for more precise device memory management: ICudaEngine::getDeviceMemorySizeV2 and IExecutionContext::setDeviceMemorySizeV2. The V1 APIs are deprecated in favor of these improved versions.

Plugin and ONNX Parser Enhancements

  • Shape Inputs for Custom Ops: When using the TensorRT ONNX parser, shape inputs can now be passed to custom ops supported by IPluginV3-based plugins. The indices of the inputs to be interpreted as shape inputs must be indicated by a node attribute named tensorrt_plugin_shape_input_indices containing a list of integers.

  • Plugin Field Data Access: TensorRT Python bindings now natively support accessing the data attribute of a PluginField of PluginFieldType.INT64 and PluginFieldType.UNKNOWN as NumPy arrays. The NumPy functions tobytes() and frombuffer() can be used during storage and retrieval to embed an arbitrary NumPy array in a PluginFieldType.UNKNOWN.

  • IPluginV3 Auto-Tuning Improvements: During IPluginV3 auto-tuning, it is now guaranteed that configurePlugin() is called with the current input/output format combination being timed before getValidTactics() is called. This enables advertising different sets of tactics per each input/output format combination.

  • Exception Handling: Internal exceptions are now contained and won’t leak through the parser API boundaries, improving robustness and debugging experience.

Operator Support

  • New ONNX Operator Support: Added native TensorRT layer support for ONNX operator isNaN, and added TensorRT plugin support for ONNX operator DeformConv, expanding model compatibility.

Samples and Tools

  • Python Sample Addition: Added a new Python sample non_zero_plugin, which is a Python version of the C++ sample sampleNonZeroPlugin, demonstrating plugin implementation in Python.

Bug Fixes and Performance

  • L4T Cross-Compilation: Fixed an issue where the sampleNonZeroPlugin sample failed to build when cross-compiling for L4T.

  • CUDA Compatibility: Fixed sampleNonZeroPlugin to guarantee CUDA minor version compatibility across different CUDA Toolkit releases and driver versions.

  • Plugin Registry: Fixed IPluginRegistry::deregisterLibrary() to work correctly with plugin shared libraries that define the getPluginCreators() entry point.

  • Multi-Head Attention (MHA) Performance: Fixed unoptimized fused Multi-Head Attention (MHA) performance on A30 GPUs.

  • ONNX Runtime Integration: Fixed an issue preventing the use of prebuilt parser when building the TensorRT backend of ONNX Runtime.

  • Weight Stripping: Fixed an issue in Polygraphy’s engine_from_network API where enabling both refittable and strip_plan did not properly strip engine weights.

  • Large Tensor Support: Fixed TensorRT to support attention operations for tensors larger than int32_t maximum, eliminating the need for plugin workarounds.

  • API Documentation: Corrected API documentation stating that Cast to INT8 format is possible; use QuantizeLinear node instead.

  • Refit Performance: Improved refit performance on Multi-Head Attention (MHA) and if/while loops with explicit quantization by eliminating slow memcpyDeviceToHost operations for Q/DQ scales.

  • Stable Diffusion Performance: Fixed an up to 9% performance regression for Stable Diffusion VAE networks on A16 and A40 compared to TensorRT 9.2.

  • H100 Stability: Fixed hanging issues on H100 with the r550 CUDA driver when CUDA graphs were used, and resolved GPU hang issues with high persistent cache usage.

  • Instance Normalization: Fixed performance issues when running instance normalization layers on Arm Server Base System Architecture (SBSA).

  • L4T Platform Fixes: Fixed Compute Sanitizer hang issues on L4T and hardware forward compatibility issues for ViT, Swin-Transformers, and BERT networks in FP16 mode on L4T Concord.

  • LSTM Build Stability: Fixed an issue where LSTM networks could fail to build with timing cache enabled due to pre-existing cache entries.

API Enhancements

Compatibility#

Limitations#

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.

  • The DLA compiler can remove identity transposes but cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshaping). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshape in between, the shuffle layer is translated into two consecutive DLA transpose layers unless the user merges the transposes manually in the model definition in advance.

  • nvinfer1::UnaryOperation::kROUND or nvinfer1::UnaryOperation::kSIGN operations of IUnaryLayer are not supported in the implicit batch mode.

  • For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset containing the corresponding function ops, such as opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to the corresponding implementation with primitive ops for normalization layers.

  • The kREFIT and kREFIT_IDENTICAL have performance regressions compared with non-refit engines where convolution layers are present within a branch or loop and the precision is FP16/INT8. This issue will be addressed in future releases.

  • Weight streaming mainly supports GEMM-based networks like Transformers for now. Convolution-based networks may have only a few weights that can be streamed.

Deprecated API Lifetime#

  • APIs deprecated in TensorRT 10.1 will be retained until 5/2025.

  • APIs deprecated in TensorRT 10.0 will be retained until 3/2025.

Deprecated and Removed Features#

  • For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.

  • Deprecated NVIDIA Volta support (GPUs with compute capability 7.0) starting with TensorRT 10.0. Volta support may be removed after 9/2024.

  • Version 1 of the ROIAlign plugin (ROIAlign_TRT), which implemented IPluginV2DynamicExt, is deprecated. It is superseded by version 2, which implements IPluginV3.

  • The TensorRT standard plugin shared library (libnvinfer_plugin.so / nvinfer_plugin.dll) only exports initLibNvInferPlugins. No symbols in the nvinfer1::plugin namespace are exported anymore.

  • Deprecated IParser::supportsModel and replaced this method with IParser::supportsModelV2, IParser::getNbSubgraphs, IParser::isSubgraphSupported, and IParser::getSubgraphNodes.

  • Deprecated some weight streaming APIs including setWeightStreamingBudget, getWeightStreamingBudget, and getMinimumWeightStreamingBudget. Replaced by new versions of weight streaming APIs.

  • Deprecated INT8 implicit quantization and calibrator APIs including dynamicRangeIsSet, CalibrationAlgoType, IInt8Calibrator, IInt8EntropyCalibrator, IInt8EntropyCalibrator2, IInt8MinMaxCalibrator, IInt8Calibrator, setInt8Calibrator, getInt8Calibrator, setCalibrationProfile, getCalibrationProfile, setDynamicRange, getDynamicRangeMin, getDynamicRangeMax, and getTensorsWithDynamicRange. They may not give the optimal performance and accuracy. As a workaround, use INT8 explicit quantization instead.

Fixed Issues#

  • The sampleNonZeroPlugin sample failed to build when cross compiling for L4T. The workaround was to continue building the other samples by modifying samples/Makefile and removing the line containing sampleNonZeroPlugin. This issue has been fixed.

  • The sampleNonZeroPlugin sample did not guarantee CUDA minor version compatibility. That is, if built against a newer CUDA Toolkit release, it may not function properly on older drivers, even within the same major CUDA release family. This issue has been fixed.

  • IPluginRegistry::deregisterLibrary() did not work with plugin shared libraries with the defined entry point getPluginCreators(). IPluginRegistry::loadLibrary() was not impacted. The workaround was to deregister the plugins that were contained in such a library, manually query the library for getPluginCreators(), and invoke IPluginRegistry::deregisterCreator() for each creator retrieved. This issue has been fixed.

  • On A30, some fused MHA (multi-head attention) performance was not optimized yet. This issue has been fixed.

  • If building the TensorRT backend of ONNX runtime we are not able to use the prebuilt parser. This issue has been fixed.

  • When using the Polygraphy engine_from_network API, if we enabled both refittable and strip_plan in the create_config, the final engine weights were not stripped. The workaround was, only include strip_plan in create_config. This issue has been fixed.

  • TensorRT did not support attention operations for tensors larger than int32_t maximum. Plugins could be used to workaround this issue. The issue has been fixed.

  • The API docs incorrectly stated that Cast to the INT8 format is possible but this path is not supported. Use a QuantizeLinear node instead. This issue has been fixed in the API docs.

  • When using refit on multi-head attention or if/while loops with explicit quantization, the refit process could have been slow due to the implementation’s memcpyDeviceToHost for the Q/DQ scales. This issue has been fixed.

  • There was an up to 9% performance regression for StableDiffusion VAE networks on A16 and A40 compared to TensorRT 9.2. The workaround was to disable the kNATIVE_INSTANCENORM flag in ONNX parser or add the --pluginInstanceNorm flag to trtexec. This issue has been fixed.

  • There was a small chance that TensorRT would hang when running on H100 with the r550 CUDA driver when CUDA graphs were used. The workaround was to use the r535 CUDA driver instead or to avoid using CUDA graphs. This issue has been fixed.

  • There was a known issue on H100 that may have led to a GPU hang when running TensorRT with high persistentCache usage. The workaround was to limit the usage to 40% of L2 cache size. This issue has been fixed.

  • There was a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA). This issue has been fixed.

  • There were some issues when running TensorRT-LLM with TensorRT 10.0 with the StronglyTyped mode enabled. The workaround was to disable StronglyTyped mode.

  • Running sync/race check with newer Compute Sanitizer on L4T may have hit a hang issue. The workaround was to try an older version of Compute Sanitizer. This issue has been fixed.

  • Hardware forward compatibility (HFC) was broken on L4T Concord for ViT, Swin-Transformers, and BERT networks in FP16 mode. The workaround was to only use FP32 mode on L4T Concord or turn off HFC. This issue has been fixed.

  • LSTM networks could fail to build with timing cache enabled. This has been observed on one GPU platform and only when building with a cache that has pre-existing entries. Error signature would contain

    Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:operator():1502]
        Internal bug. Please report with reproduction steps.
    

    The workaround was to disable the timing cache or start a fresh one. This issue has been fixed.

Known Issues#

Functional

  • The tensorrt Python metapackage does not pin the version for the Python module dependency tensorrt-cu12. For example, using pip install tensorrt==10.0.1 will install tensorrt-cu12==10.1.0 rather than tensorrt-cu12==10.0.1 as expected. The workaround is to instead specify the package name including the CUDA version, such as pip install tensorrt-cu12==10.0.1. This issue will be fixed in TensorRT 10.2.

  • The Python sample yolo_v3_onnx does not support Python 3.12. Support will be added in 10.2.

  • If TensorRT 8.6 or 9.x was installed using the Python Package Index (PyPI), you cannot upgrade TensorRT to 10.x using PyPI. You must first uninstall TensorRT using pip uninstall tensorrt tensorrt-libs tensorrt-bindings, then reinstall TensorRT using pip install tensorrt. This will remove the previous TensorRT version and install the latest TensorRT 10.x. This step is required because the suffix -cuXX was added to the Python package names, which prevents the upgrade from working properly.

  • Allocated GPU memory during autotuning might not be freed correctly if allocation failed due to inadequate resources, causing build time memory usage to be larger than that of inference time.

  • CUDA compute sanitizer may report racecheck hazards for some legacy kernels. However, related kernels do not have functional issues at runtime.

  • The compute sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.

  • Multi-Head Attention fusion might not happen and affect performance if the number of heads is small.

  • If a network has a tensor of type bool with an implicitly data-dependent shape, engine building will likely fail.

  • An occurrence of use-after-free in NVRTC has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.

  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.

    {
        Memory leak errors with dlopen.
        Memcheck:Leak
        match-leak-kinds: definite
        ...
        fun:*dlopen*
        ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.

  • Installing the cuda-compat-11-4 package may interfere with CUDA-enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470.

  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.

  • For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in kDLA_LINEAR format. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.

  • Exclusive padding with kAVERAGE pooling is not supported.

  • The Valgrind tool found a memory leak on L4T with CUDA 12.4 due to a known driver issue. This is expected to be fixed in CUDA 12.6.

  • Asynchronous CUDA calls are not supported in the user-defined processDebugTensor function for the debug tensor feature due to a bug in Windows 10.

  • For sample python/efficientdet and python/tensorflow_object_detection_api, the ONNX version needs to be manually downgraded in their respective requirements.txt file to 1.14 for the sample to function correctly. - There is a known accuracy issue when binding an INT4 tensor as a network output. To workaround this, add an IDequantizeLayer before the output.

  • There is a known accuracy issue when the network contains two consecutive GEMV operations (that is, MatrixMultiply with gemmM or gemmN == 1). To workaround this issue, try padding the MatrixMultiply input to have dimensions greater than 1.

  • Engine building with weight streaming enabled will fail when the model size is larger than the free device memory size. This issue will be fixed in the next version.

Performance

  • There is an up to 10% performance regression for ConvNext on NVIDIA Orin compared to TensorRT 9.3.

  • There are known performance gaps between engines built with REFIT enabled and engines built with REFIT disabled.

  • There is an up to 4x performance regression for networks containing GridSample ops compared to TensorRT 9.2.

  • Up to 60 MB engine size fluctuations for the BERT-Large INT8-QDQ model on Orin due to unstable tactic selection among tactics.

  • Up to 16% performance regression for BasicUNet, DynUNet, and HighResNet in INT8 precision compared to TensorRT 9.3.

  • There are performance gaps for StableDiffusion networks between Windows and Linux platforms.

  • Up to 40-second increase in engine building for BART networks on NVIDIA Hopper GPUs.

  • Up to 20-second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.

  • Up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert-like models due to additional tactics available for evaluation.

  • Up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.

  • Up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.

  • Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.

  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.

  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.

  • Up to 5% performance drop for networks using sparsity in FP16 precision.

  • Up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.

  • Up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. Enable FP16 and disable INT8 in the builder config to work around this.

  • In explicitly quantized networks, a group convolution with a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, NVIDIA Hopper may fall back to FP32-IN-FP32-OUT if the input channel count is small.

  • Weight streaming performance may decrease when you create execution contexts with multiple optimization profiles using external device memory and call setDeviceMemory/ setDeviceMemoryV2 before setOptimizationProfileAsync. This issue will be fixed in the next version.