TensorRT 10.3.0 Release Notes#

These Release Notes apply to x86 Linux and Windows users, ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux, and JetPack users. This release includes several fixes from the previous TensorRT releases and additional changes.

Announcements#

Cross-Platform Engine Support (Experimental): Introduced experimental support for building TensorRT engines on one platform and running them on another through the new setRuntimePlatform API. Currently supports building on Linux x86_64 and running on Windows x86_64. This feature is expected to become production-ready in the next release.

NVIDIA Volta Deprecation: NVIDIA Volta support (GPUs with compute capability 7.0) is deprecated starting with TensorRT 10.0 and will be removed in TensorRT 10.5. Plan migration to supported GPU architectures.

Plugin Deprecations: Version 1 of the ScatterElements plugin has been deprecated in favor of version 2, which implements the IPluginV3 interface. For details, refer to the Deprecated and Removed Features section.

Key Features and Enhancements#

Cross-Platform Compatibility

  • Runtime Platform API (Preview): Added a new setRuntimePlatform API in IBuilderConfig to enable cross-platform engine deployment. This experimental feature allows building engines on Linux x86_64 platforms and executing them on Windows x86_64 platforms, simplifying multi-platform workflows.

    Note

    This feature is experimental and expected to become production-ready in the next TensorRT release.

Performance Optimizations

  • GEMM Build Time Improvements: Significantly improved engine build time for networks with GEMMs that have large weight constants, reducing deployment time for transformer-based and LLM models.

  • Normalization Fusion: Enhanced performance of Normalization layer and FP8 quantization fusion, improving efficiency for models using normalization operations.

Hardware and Precision Support

  • FP8 Convolution on Ada Lovelace: Added FP8 convolution support for NVIDIA Ada Lovelace GPUs (SM 89) with the same capabilities and limitations as NVIDIA Hopper, expanding FP8 deployment options.

Plugin Enhancements

  • Aliased Plugin I/O: Added capability to alias input-output pairs for TensorRT plugins implementing IPluginV3. Plugins must implement the IPluginV3OneBuildV2 build capability interface, and PreviewFeature::kALIASED_PLUGIN_IO_10_03 must be enabled to use this memory-efficient feature.

Bug Fixes and Performance

  • Hopper GPU Stability: Fixed an intermittent hanging issue when running multiple execution contexts in parallel on Hopper GPUs, improving multi-stream reliability.

  • FP8 Convolution Accuracy: Fixed an accuracy issue when running ResNet18/ResNet50 with FP8 convolutions, ensuring correct inference results.

  • TopK Stability: Fixed a divide-by-zero error in TopK operations when K equals 0 and the reduction dimension was large.

  • Diffusion Networks: Resolved an issue where engine building might fail for Diffusion networks without the -stronglyTyped option.

  • Vision Model Performance: Fixed up to 10% performance regression for ConvNext on NVIDIA Orin and up to 4x regression for networks containing GridSample operations.

  • Library Size Optimization: Fixed a 10 MB increase in the libnvinfer_lean.so library size.

  • Python 3.12 Support: Enabled Python 3.12 support for multiple samples including detectron2, efficientdet, efficientnet, and others.

  • Slice Layer Enhancement: Fixed ISliceLayer to properly handle constant tensor inputs with kCLAMP or kFILL modes.

  • Static Linking: Resolved linker relocation issues when compiling samples with static linking.

API Enhancements

Compatibility#

Limitations#

  • There are no optimized FP8 Convolutions for Group Convolutions and Depthwise Convolutions. Therefore, INT8 is still recommended for ConvNets containing these convolution ops.

  • The FP8 Convolutions only support input/output channels, which are multiples of 16. Otherwise, TensorRT will fall back to non-FP8 convolutions.

  • The accumulation dtype for the batched GEMMS in the FP8 MHA must be in FP32.

    • This can be achieved by adding Cast (to FP32) ops before the batched GEMM and Cast (to FP16) after the batched GEMM.

    • Alternatively, you can convert your ONNX model using TensorRT Model Optimizer, which adds the Cast ops automatically.

  • There cannot be any pointwise operations between the first batched GEMM and the softmax inside FP8 MHAs (for example, having an attention mask). This will be improved in future TensorRT releases.

  • The FP8 MHA fusions only support head sizes being multiples of 16. If the MHA has a head size that is not a multiple of 16, do not add Q/DQ ops in the MHA to fall back to the FP16 MHA for better performance.

  • On QNX, networks that are segmented into a large number of DLA loadables can fail during inference.

  • The DLA compiler can remove identity transposes but cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshaping). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshape in between, the shuffle layer is translated into two consecutive DLA transpose layers unless you manually merge the transposes in the model definition in advance.

  • nvinfer1::UnaryOperation::kROUND or nvinfer1::UnaryOperation::kSIGN operations of IUnaryLayer are not supported in the implicit batch mode.

  • For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset containing the corresponding function ops (for example, opset 17 for LayerNormalization or opset 18 GroupNormalization). Numerical accuracy using function ops is superior to the corresponding implementation with primitive ops for normalization layers.

  • Weight streaming mainly supports GEMM-based networks like Transformers for now. Convolution-based networks may have only a few weights that can be streamed.

  • Deprecated INT8 implicit quantization and calibrator APIs including dynamicRangeIsSet, CalibrationAlgoType, IInt8Calibrator, IInt8EntropyCalibrator, IInt8EntropyCalibrator2, IInt8MinMaxCalibrator, IInt8Calibrator, setInt8Calibrator, getInt8Calibrator, setCalibrationProfile, getCalibrationProfile, setDynamicRange, getDynamicRangeMin, getDynamicRangeMax, and getTensorsWithDynamicRange. They may not give the optimal performance and accuracy. As a workaround, use INT8 explicit quantization instead.

  • When two convolutions with INT8-QDQ and residual add share the same weight, constant weight fusion does not occur. Make a copy of the shared weight for better performance.

Deprecated API Lifetime#

  • APIs deprecated in TensorRT 10.3 will be retained until 8/2025.

  • APIs deprecated in TensorRT 10.2 will be retained until 7/2025.

  • APIs deprecated in TensorRT 10.1 will be retained until 5/2025.

  • APIs deprecated in TensorRT 10.0 will be retained until 3/2025.

Deprecated and Removed Features#

  • For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.

  • Deprecated NVIDIA Volta support (GPUs with compute capability 7.0) starting with TensorRT 10.0. Volta support will be removed in TensorRT 10.5.

  • Deprecated version 1 of ScatterElements plugin. It is superseded by version 2, which implements the IPluginV3 interface.

Fixed Issues#

  • Fixed an intermittent hanging issue when running multiple execution contexts in parallel on Hopper GPUs.

  • There was a known accuracy issue when running ResNet18/ResNet50 with FP8 convolutions. This issue has been fixed.

  • There could have been a divide-by-zero error for TopK when K equals 0, and the reduction dimension was large.

  • There was a known issue that engine building might fail for Diffusion networks. The workaround was to enable the -stronglyTyped option.

  • There was an up to 10% performance regression for ConvNext on NVIDIA Orin compared to TensorRT 9.3.

  • There was an up to 4x performance regression for networks containing GridSample ops compared to TensorRT 9.2.

  • The size of the libnvinfer_lean.so library had increased by 10 MB. This issue has been fixed.

  • Python 3.12 support is enabled for samples including detectron2, efficientdet, efficientnet, engine_refit_onnx_bidaf, introductory_parser_samples, network_api_pytorch_mnist, onnx_custom_plugin, onnx_packnet, sample_weight_stripping, simple_progress_monitor, tensorflow_object_detection_api, and yolo_v3_onnx.

  • When compiling samples with static linking, if the error message /usr/bin/ld: failed to convert GOTPCREL relocation; relink with –no-relax was shown, the work around was to add -Wl,--no-relax to the linking steps in samples/Makefile.config. This issue has been fixed.

  • nvinfer1::ISliceLayer with modes nvinfer1::SampleMode::kCLAMP or nvinfer1::SampleMode::kFILL (ONNX equivalent being a Pad op with either constant or edge mode) could break during engine compilation if the slice input was a constant tensor. This issue has been fixed so slice can handle constant inputs.

Known Issues#

Functional

  • There is a known engine build failure if FP8-Q/DQ ops are added before a convolution op whose input/output channels are not multiples of 16. Remove the FP8-Q/DQ ops for these convolutions to work around this issue.

  • The Python sample non_zero_plugin and python_plugin does not support Python 3.12. Support will be added in 10.4. The issue is fixed in the OSS 10.3 release.

  • If TensorRT 8.6 or 9.x was installed using the Python Package Index (PyPI), you cannot upgrade TensorRT to 10.x using PyPI. You must first uninstall TensorRT using pip uninstall tensorrt tensorrt-libs tensorrt-bindings, then reinstall TensorRT using pip install tensorrt. This will remove the previous TensorRT version and install the latest TensorRT 10.x. This step is required because the suffix -cuXX was added to the Python package names, which prevents the upgrade from working properly.

  • CUDA compute sanitizer may report racecheck hazards for some legacy kernels. However, related kernels do not have functional issues at runtime.

  • The compute sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.

  • Multi-Head Attention fusion might not happen and affect performance if the number of heads is small.

  • An occurrence of use-after-free in NVRTC has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.

  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.

    {
        Memory leak errors with dlopen.
        Memcheck:Leak
        match-leak-kinds: definite
        ...
        fun:*dlopen*
        ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.

  • Installing the cuda-compat-11-4 package may interfere with CUDA-enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470.

  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.

  • For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in kDLA_LINEAR format. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.

  • Exclusive padding with kAVERAGE pooling is not supported.

  • The Valgrind tool found a memory leak on L4T with CUDA 12.4 due to a known driver issue. This is expected to be fixed in CUDA 12.6.

  • Asynchronous CUDA calls are not supported in the user-defined processDebugTensor function for the debug tensor feature due to a bug in Windows 10.

  • A known accuracy issue exists when the network contains two consecutive GEMV operations (MatrixMultiply with gemmM or gemmN == 1). To workaround this issue, try padding the MatrixMultiply input to have dimensions greater than 1.

  • A known accuracy issue exists when binding an INT4 tensor as a network output. To work around this, add an IDequantizeLayer before the output.

  • With CUDA 12.5 on Windows, fcPlugin (CustomFCPluginDynamic) may result in CUDA errors on certain GPUs.

  • The libnvonnxparser_static.a static library included within AArch64 tar packages, which includes the SBSA and JetPack releases, is incorrectly constructed and contains a mix of x86_64 and AArch64 object files. This prevents the ONNX parser static library from being used on non-x86 platforms. This issue will be fixed in the next release.

  • The kREFIT and kREFIT_IDENTICAL have performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.

  • A known accuracy issue exists in network patterns fc-xelu-bias and conv-xelu-bias (when bias operation comes after xelu).

  • The size of the compilation cache may increase or decrease slightly for the same network.

  • When installing TensorRT using the Python wheels hosted on PyPI, the loader cannot find the lean library while deserializing version-compatible engines. You can workaround this by manually adding the wheel installation directory to your loader path. For example, on Linux, you might run the following:

    export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib/python3.10/dist-packages/tensorrt_lean_libs/
    

Performance

  • The BERT demo is unsupported on Python 3.11 environments. This issue will be fixed in the TensorRT 10.5 release.

  • Up to 45% build time regression for mamba_370m in FP16 precision and OOTB mode on NVIDIA Ada Lovelace GPUs compared to TensorRT 10.2.

  • Up to 15% CPU memory usage regression for mbart-cnn/mamba-370m in FP16 precision and OOTB mode on NVIDIA Ada Lovelace GPUs compared to TensorRT 10.2.

  • Up to 6% performance regression for BERT/Megatron networks with INT8 QDQ compared to TensorRT 10.2 on Ampere and Hopper GPUs.

  • Up to 6% performance regression for BERT/Megatron networks in FP16 precision compared to TensorRT 10.2 for BS1 and Seq128 on H100 GPUs.

  • Up to 6% performance regression for Bidirectional LSTM in FP16 precision on H100 GPUs compared to TensorRT 10.2.

  • Up to 8% performance regression for PilotNet networks in FP16 precision on Orin and H100 GPUs compared to TensorRT 10.2, and up to 560% regression in TF32/FP32 precisions.

  • Up to 300% performance regression for networks containing RandomUniform ops compared to TensorRT 8.2.

  • Up to 25% performance regression when running TensorRT-LLM without the attention plugin. The current recommendation is to always enable the attention plugin when using TensorRT-LLM.

  • There are known performance gaps between engines built with REFIT enabled and engines built with REFIT disabled.

  • Up to 60 MB engine size fluctuations for the BERT-Large INT8-QDQ model on Orin due to unstable tactic selection among tactics.

  • Up to 16% performance regression for BasicUNet, DynUNet, and HighResNet in INT8 precision compared to TensorRT 9.3.

  • Up to 40-second increase in engine building for BART networks on NVIDIA Hopper GPUs.

  • Up to 20-second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.

  • Up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert-like models due to additional tactics available for evaluation.

  • Up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.

  • Up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.

  • Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.

  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.

  • Up to 5% performance drop for networks using sparsity in FP16 precision.

  • Up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.

  • Up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. Enable FP16 and disable INT8 in the builder config to work around this.

  • In explicitly quantized networks, a group convolution with a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, NVIDIA Hopper may fall back to FP32-IN-FP32-OUT if the input channel count is small.