TensorRT 10.2.0 Release Notes#

These Release Notes apply to x86 Linux and Windows users, and ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux. This release includes several fixes from the previous TensorRT releases and additional changes.

Announcements#

NVIDIA Volta Deprecation: NVIDIA Volta support (GPUs with compute capability 7.0) is deprecated starting with TensorRT 10.0 and will be removed in TensorRT 10.5. Plan migration to supported GPU architectures.

Key Features and Enhancements#

Quantization and Precision Enhancements

  • FP8 Convolution Support: Added support for normal FP8 Convolutions on Hopper GPUs, expanding precision options for inference optimization.

  • Extended FP8 Multi-Head Attention (MHA) Fusion: Added support for FP8 MHA fusion for sequence lengths greater than 512 on Hopper GPUs, enabling efficient processing of longer sequences in transformer models.

Performance Optimizations

  • Stable Diffusion Improvements: Improved InstanceNorm and GroupNorm fusions for Stable Diffusion models, enhancing generative AI performance.

  • Memory Optimization: Improved DRAM utilization for LayerNorm, pointwise operations, and data movement kernels (for example, Concats, Slices, Reshapes, Transposes) on GPUs with HBM memory, reducing memory bottlenecks.

API Enhancements

  • Fine-Grained Refit Control: Added new APIs in INetworkDefinition for fine-grained control of refittable weights:

    • markWeightsRefittable to mark weights as refittable

    • unmarkWeightsRefittable to unmark weights as refittable

    • areWeightsMarkedRefittable to query if a weight is marked as refittable

    This fine-grained refit control is only valid when the new kREFIT_INDIVIDUAL builder flag is used during engine build. It also works with kSTRIP_PLAN, enabling the construction of a weight-stripped engine that can be updated from a fine-tuned checkpoint when all weights are marked as refittable in the fine-tuned checkpoint.

  • API Change Tracking: To view API changes between releases, refer to the TensorRT GitHub repository and use the compare tool.

Bug Fixes and Performance

  • Weight Streaming Stability: Fixed an issue where engine building with weight streaming enabled would fail when the model size exceeds the available device memory.

  • Execution Context Performance: Fixed decreased weight streaming performance when creating execution contexts with multiple optimization profiles using external device memory and calling setDeviceMemory/setDeviceMemoryV2 before setOptimizationProfileAsync.

  • Windows Header Encoding: Fixed an issue with header files in the include directory for Windows being encoded as UTF-16 instead of UTF-8.

  • Python Package Versioning: Fixed an issue in TensorRT 10.0 and 10.1 where the tensorrt Python metapackage did not pin the version for the tensorrt-cu12 dependency, causing unintended version upgrades.

  • Memory Management: Fixed an issue with large models where TensorRT did not free memory after an OOM error, causing tactics that should fit in memory to also fail.

  • Boolean Tensor Handling: Fixed engine building for networks with boolean tensors with implicitly data-dependent shapes.

  • Sample Dependencies: Fixed ONNX version incompatibility in requirements.txt for the python/efficientdet and python/tensorflow_object_detection_api samples.

Compatibility#

Limitations#

  • There are no optimized FP8 Convolutions for Group Convolutions and Depthwise Convolutions. Therefore, INT8 is still recommended for ConvNets containing these convolution ops.

  • The accumulation dtype for the batched GEMMS in the FP8 MHA must be in FP32.

    • This can be achieved by adding Cast (to FP32) ops before the batched GEMM and Cast (to FP16) after the batched GEMM.

    • Alternatively, you can convert your ONNX model using TensorRT Model Optimizer, which adds the Cast ops automatically.

  • There cannot be any pointwise operations between the first batched GEMM and the softmax inside FP8 MHAs, such as having an attention mask. This will be improved in future TensorRT releases.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.

  • The DLA compiler can remove identity transposes but cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshaping). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshape in between, the shuffle layer is translated into two consecutive DLA transpose layers unless the user merges the transposes manually in the model definition in advance.

  • nvinfer1::UnaryOperation::kROUND or nvinfer1::UnaryOperation::kSIGN operations of IUnaryLayer are not supported in the implicit batch mode.

  • For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset containing the corresponding function ops, such as opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to the corresponding implementation with primitive ops for normalization layers.

  • The kREFIT and kREFIT_IDENTICAL have performance regressions compared with non-refit engines where convolution layers are present within a branch or loop and the precision is FP16/INT8. This issue will be addressed in future releases.

  • Weight streaming mainly supports GEMM-based networks like Transformers for now. Convolution-based networks may have only a few weights that can be streamed.

  • Deprecated INT8 implicit quantization and calibrator APIs including dynamicRangeIsSet, CalibrationAlgoType, IInt8Calibrator, IInt8EntropyCalibrator, IInt8EntropyCalibrator2, IInt8MinMaxCalibrator, IInt8Calibrator, setInt8Calibrator, getInt8Calibrator, setCalibrationProfile, getCalibrationProfile, setDynamicRange, getDynamicRangeMin, getDynamicRangeMax, and getTensorsWithDynamicRange. They may not give the optimal performance and accuracy. As a workaround, use INT8 explicit quantization instead.

Deprecated API Lifetime#

  • APIs deprecated in TensorRT 10.2 will be retained until 7/2025.

  • APIs deprecated in TensorRT 10.1 will be retained until 5/2025.

  • APIs deprecated in TensorRT 10.0 will be retained until 3/2025.

Deprecated and Removed Features#

  • For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.

  • Deprecated NVIDIA Volta support (GPUs with compute capability 7.0) starting with TensorRT 10.0. Volta support will be removed in TensorRT 10.5.

Fixed Issues#

  • Fixed issue where engine building with weight streaming enabled would fail when the model size exceeds the available device memory.

  • Fixed issue of decreased weight streaming performance when creating execution contexts with multiple optimization profiles using external device memory and call setDeviceMemory/ setDeviceMemoryV2 before setOptimizationProfileAsync.

  • Fixed issue with header files in the include directory for Windows being encoded as UTF-16 instead of UTF-8.

  • In the previous 10.0 and 10.1 TensorRT releases the tensorrt Python metapackage did not pin the version for the Python module dependency tensorrt-cu12. This caused the latest TensorRT version to always be installed. This issue has been fixed.

  • Fixed issue with large models where TensorRT does not free memory after an OOM error, causing tactics that should fit in memory to also fail.

  • If a network has a tensor of type bool with an implicitly data-dependent shape, engine building should now work.

  • The ONNX version in requirements.txt for sample python/efficientdet and python/tensorflow_object_detection_api is incompatible with the samples. The workaround was to pin the ONNX version to 1.14.0 for the samples to function correctly. This issue has been fixed.

Known Issues#

Functional

  • nvinfer1::ISliceLayer with modes nvinfer1::SampleMode::kCLAMP or nvinfer1::SampleMode::kFILL (ONNX equivalent being a Pad op with either constant or edge mode) may break during engine compilation if the slice input is a constant tensor. To workaround this issue use TensorRT 10.1. This will be addressed in future TensorRT releases.

  • There is a known accuracy issue when running ResNet18/ResNet50 with FP8 convolutions. This will be fixed in the next TensorRT version.

  • The Python sample detectron2, efficientdet, efficientnet, engine_refit_onnx_bidaf, introductory_parser_samples, network_api_pytorch_mnist, onnx_custom_plugin, onnx_packnet, sample_weight_stripping, simple_progress_monitor, tensorflow_object_detection_api, and yolo_v3_onnx does not support Python 3.12. Support will be added in 10.3. The issue is fixed in OSS 10.2 release

  • If TensorRT 8.6 or 9.x was installed using the Python Package Index (PyPI), you cannot upgrade TensorRT to 10.x using PyPI. You must first uninstall TensorRT using pip uninstall tensorrt tensorrt-libs tensorrt-bindings, then reinstall TensorRT using pip install tensorrt. This will remove the previous TensorRT version and install the latest TensorRT 10.x. This step is required because the suffix -cuXX was added to the Python package names, which prevents the upgrade from working properly.

  • CUDA compute sanitizer may report racecheck hazards for some legacy kernels. However, related kernels do not have functional issues at runtime.

  • The compute sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.

  • Multi-Head Attention fusion might not happen and affect performance if the number of heads is small.

  • An occurrence of use-after-free in NVRTC has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.

  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.

    {
        Memory leak errors with dlopen.
        Memcheck:Leak
        match-leak-kinds: definite
        ...
        fun:*dlopen*
        ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.

  • Installing the cuda-compat-11-4 package may interfere with CUDA-enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470.

  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.

  • For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in kDLA_LINEAR format. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.

  • Exclusive padding with kAVERAGE pooling is not supported.

  • The Valgrind tool found a memory leak on L4T with CUDA 12.4 due to a known driver issue. This is expected to be fixed in CUDA 12.6.

  • Asynchronous CUDA calls are not supported in the user-defined processDebugTensor function for the debug tensor feature due to a bug in Windows 10.

  • A known accuracy issue exists when the network contains two consecutive GEMV operations (MatrixMultiply with gemmM or gemmN == 1). To workaround this issue, try padding the MatrixMultiply input to have dimensions greater than 1.

  • The size of the libnvinfer_lean.so library has increased by 10 MB. This issue will be resolved in TensorRT 10.3.

  • When compiling samples with static linking, if the error message /usr/bin/ld: failed to convert GOTPCREL relocation; relink with –no-relax is shown, then add -Wl,--no-relax to the linking steps in samples/Makefile.config.

  • There is a known accuracy issue when binding an INT4 tensor as a network output. To workaround this, add an IDequantizeLayer before the output.

  • With CUDA 12.5 on Windows, fcPlugin (CustomFCPluginDynamic) may result in CUDA errors on certain GPUs.

  • There may be divide-by-zero errors for TopK when K equals to 0 and the reduction dimension is large.

Performance

  • There is an up to 20% CPU memory usage increase compared to TensorRT 10.1 when building engines on A10 GPUs.

  • There is an up to 25% performance regression when running TensorRT-LLM without the attention plugin. The current recommendation is to always enable the attention plugin when using TensorRT-LLM.

  • There is an up to 24% performance regression for EfficientNets, StableDiffusion CLIP, and StableDiffusion UNet on RTX 3070 GPU on Windows.

  • There is an up to 10% performance regression for ConvNext on NVIDIA Orin compared to TensorRT 9.3.

  • There are known performance gaps between engines built with REFIT enabled and engines built with REFIT disabled.

  • There is an up to 4x performance regression for networks containing GridSample ops compared to TensorRT 9.2.

  • Up to 60 MB engine size fluctuations for the BERT-Large INT8-QDQ model on Orin due to unstable tactic selection among tactics.

  • Up to 16% performance regression for BasicUNet, DynUNet, and HighResNet in INT8 precision compared to TensorRT 9.3.

  • Up to 40-second increase in engine building for BART networks on NVIDIA Hopper GPUs.

  • Up to 20-second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.

  • Up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert-like models due to additional tactics available for evaluation.

  • Up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.

  • Up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.

  • Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.

  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.

  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.

  • Up to 5% performance drop for networks using sparsity in FP16 precision.

  • Up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.

  • Up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. Enable FP16 and disable INT8 in the builder config to work around this.

  • In explicitly quantized networks, a group convolution with a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, NVIDIA Hopper may fall back to FP32-IN-FP32-OUT if the input channel count is small.