TensorRT 10.12.0 Release Notes#

These Release Notes apply to x86 Linux and Windows users, and ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux. This release includes several fixes from the previous TensorRT releases and additional changes.

Announcements#

Breaking packaging changes that may require updates to your build and deployment scripts:

  • Static libraries on Linux (libnvinfer_static.a, libnvonnxparser_static.a, etc.) are deprecated starting with TensorRT 10.11 and will be removed in TensorRT 11.0. Migrate to shared libraries.

Weak typing APIs deprecated: APIs related to weak typing have been deprecated. TensorRT will be using strong-typing exclusively in the future. Users should convert their engines to use strong typing. Refer to Strong Typing vs Weak Typing for migration guidance.

Plugin deprecations: Multiple plugin versions have been deprecated in this release. Refer to the Deprecated and Removed Features section for the complete list and migration alternatives.

Key Features and Enhancements#

Quantization Enhancements

  • MXFP8 Quantization Support: Added support for MXFP8 quantization, which performs block quantization by quantizing across 32 high-precision elements to produce 32 quantized output values and one E8M0 scaling factor.

Developer Tools and Debugging

  • Enhanced Debug Tensor Feature: Extended the debug tensor feature to allow marking all unfused tensors as debug tensors without preventing fusion optimization, making marking easier. Supported dumping intermediate tensors in summary, NumPy, string, and raw data formats in trtexec by utilizing this feature.

  • Distributive Independence Determinism: Introduced the distributive independence feature to support determinism across the distributive axis of output tensor. If some inputs are identical across the distributive axis, the corresponding outputs are guaranteed to be identical. Refer to the Distributive Independence Determinism section for the definition of distributive axis for different layers.

Samples

  • Refactored Python Samples: Introduced two refactored Python samples with cleaner code structure and comprehensive documentation:

    • 1_run_onnx_with_tensorrt: Demonstrates ONNX model conversion with performance comparison

    • 2_construct_network_with_layer_apis: Demonstrates network construction using TensorRT Layer APIs for LSTM networks

Bug Fixes and Performance

  • DLA Accuracy Fix: Fixed an issue where using a batch size of 4096 could have caused accuracy degradation on DLA for some networks.

  • Loop Output Handling: Fixed an issue where networks with ILoopOutputLayers defining scan outputs without being counted loops would be incorrectly rejected even when concatenation length values were set.

  • Swim Transformer Stability: Fixed an intermittent crash issue when running Swim Transformer models.

  • Quantized MatMul Fusion: Fixed an issue where engines could not build for quantized MatMul layers that could be horizontally fused with per-tensor quantization scales.

  • Blackwell GPU Accuracy: Fixed accuracy issues on Conv layers in the SDXL network on NVIDIA B200 and certain networks on NVIDIA HGX H20.

  • Deterministic Output: Fixed an issue where different batch sizes could generate different outputs on H20 and RTX5080 GPUs for OSS demoBERT FP16 inference. This can now be resolved using the --distributive_independence flag.

API Enhancements

Compatibility#

Limitations#

  • When a Shuffle is added to the output, an additional dimension is added to the inner-most dimension. For the NCHW32 format, the C dimension then changes indices; TensorRT does not know which dimension is the C dimension and can only infer this using some heuristics.

  • Unfused tensor dump may fail in trtexec when the tensor name is long and filename length exceeds the OS limitation; this will be fixed in a future TensorRT release.

  • In some rare cases, FP8 MHA on SM90 might have had accuracy issues with sequence length < 256.

  • There are no optimized FP8 Convolutions for Group Convolutions and Depthwise Convolutions; therefore, INT8 is still recommended for ConvNets containing these convolution ops.

  • The FP8 Convolutions only support input/output channels that are multiples of 16; otherwise, TensorRT will fall back to non-FP8 convolutions.

  • The FP8 Convolutions do not support kernel sizes larger than 32 (for example, 7x7 convolutions); FP16 or FP32 fallback kernels will be used with suboptimal performance. Therefore, do not add FP8 Q/DQ ops before Convolutions with large kernel sizes for better performance.

  • There cannot be any pointwise operations between the first batched GEMM and the softmax inside FP8 MHAs (for example, having an attention mask). This will be improved in future TensorRT releases.

  • For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset containing the corresponding function ops (for example, opset 17 for LayerNormalization or opset 18 GroupNormalization). Numerical accuracy using function ops is superior to the corresponding implementation with primitive ops for normalization layers.

  • When building the nonZeroPlugin sample on Windows, you might need to modify the CUDA version specified in the BuildCustomizations paths in the vcxproj file to match the installed version of CUDA.

  • The weights used in INT4 weights-only quantization (WoQ) cannot be refitted.

  • The high-precision weights used in FP4 double quantization are not refittable.

  • Python samples do not support Python 3.13. Only the 3.13 Python bindings are currently supported.

  • Loops with scan outputs (ILoopOutputLayer with LoopOutput property being either LoopOutput::kCONCATENATE or LoopOutput::kREVERSE) must have the number of iterations set, that is, must have an ITripLimitLayer with TripLimit::kCOUNT. This requirement has always been present, but is now explicitly enforced instead of quietly having undefined behavior.

Deprecated API Lifetime#

  • APIs deprecated in TensorRT 10.12 will be retained until 6/2026.

  • APIs deprecated in TensorRT 10.11 will be retained until 5/2026.

  • APIs deprecated in TensorRT 10.10 will be retained until 4/2026.

  • APIs deprecated in TensorRT 10.9 will be retained until 3/2026.

  • APIs deprecated in TensorRT 10.8 will be retained until 2/2026.

  • APIs deprecated in TensorRT 10.7 will be retained until 12/2025.

  • APIs deprecated in TensorRT 10.6 will be retained until 11/2025.

  • APIs deprecated in TensorRT 10.5 will be retained until 10/2025.

  • APIs deprecated in TensorRT 10.4 will be retained until 9/2025.

  • APIs deprecated in TensorRT 10.3 will be retained until 8/2025.

  • APIs deprecated in TensorRT 10.2 will be retained until 7/2025.

Deprecated and Removed Features#

  • For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.

  • Deprecated APIs related to weak typing. In the future, TensorRT will be using strong-typing exclusively. We recommend that users convert their engines to use strong typing (refer to Strong Typing vs Weak Typing). The following APIs have been deprecated:

    Table 1 Deprecated APIs#

    C++ API

    Python API

    ILayer::setPrecision

    ILayer.precision

    ILayer::precisionIsSet

    ILayer.precision_is_set

    ILayer::resetPrecision

    ILayer.reset_precision

    ILayer::setOutputType

    ILayer.set_output_type

    ILayer::outputTypeIsSet

    ILayer.output_type_is_set

    ILayer::resetOutputType

    ILayer.reset_output_type

    ITensor::setType

    ITensor.dtype

    BuilderFlag::kOBEY_PRECISION_CONSTRAINT

    BuilderFlag.OBEY_PRECISION_CONSTRAINTS

    BuilderFlag::kPREFER_PRECISION_CONSTRAINT

    BuilderFlag.PREFER_PRECISION_CONSTRAINTS

    BuilderFlag::kSTRICT_TYPES

    BuilderFlag.PREFER_PRECISION_CONSTRAINTS

    BuilderFlag::kINT4

    BuilderFlag.INT4

    BuilderFlag::kFP4

    BuilderFlag.FP4

    BuilderFlag::kINT8

    BuilderFlag.INT8

    BuilderFlag::kFP8

    BuilderFlag.FP8

    BuilderFlag::kFP16

    BuilderFlag.FP16

    BuilderFlag::kBF16

    BuilderFlag.BF16

    NetworkDefinitionCreationFlag::kSTRONGLY_TYPED

    NetworkDefinitionCreationFlag.STRONGLY_TYPED

  • Deprecated the listed versions of the following plugins:

    • DecodeBbox3DPlugin (version 1)

    • DetectionLayer_TRT (version 1)

    • EfficientNMS_TRT (version 1)

    • FlattenConcat_TRT (version 1)

    • GenerateDetection_TRT (version 1)

    • GridAnchor_TRT (version 1)

    • GroupNormalizationPlugin (version 1)

    • InstanceNormalization_TRT (version 2)

    • ModulatedDeformConv2d (version 1)

    • MultilevelCropAndResize_TRT (version 1)

    • MultilevelProposeROI_TRT (version 1)

    • RPROI_TRT (version 1)

    • PillarScatterPlugin (version 1)

    • PriorBox_TRT (version 1)

    • ProposalLayer_TRT (version 1)

    • ProposalDynamic (version 1)

    • Region_TRT (version 1)

    • Reorg_TRT (version 2)

    • ResizeNearest_TRT (version 1)

    • ScatterND (version 1)

    • VoxelGeneratorPlugin (version 1)

    For some plugins, alternatives are available via native layers, while others are being deprecated without direct alternatives due to being primarily applicable for outdated use cases. Refer to each plugin’s README.md file; located in the respective source folder for that plugin, for specific information about the deprecation.

  • The TensorRT static libraries are deprecated on Linux starting with TensorRT 10.11. If you are using the static libraries for building your application, migrate to building your application with the shared libraries. The following library files will be removed in TensorRT 11.0.

    • libnvinfer_static.a

    • libnvinfer_plugin_static.a

    • libnvinfer_lean_static.a

    • libnvinfer_dispatch_static.a

    • libnvinfer_vc_plugin_static.a

    • libnvonnxparser_static.a

    • libonnx_proto.a

Fixed Issues#

  • Fixed an issue where using a batch size of 4096 could have caused accuracy degradation on DLA for some networks.

  • Fixed an issue where networks with ILoopOutputLayers that defined scan outputs (LoopOutput::kCONCATENATE or LoopOutput::kREVERSE) but were not counted loops (no ITripLimitLayer with TripLimit::kCOUNT) would be incorrectly rejected even when the ILoopOutputLayers had their second inputs (concatenation length value) set.

  • Fixed an intermittent crash issue when running Swim Transformer models.

  • Fixed an issue where engines could not build for quantized MatMul layers that could be horizontally fused when the quantizations were per-tensor scales.

  • Fixed an accuracy issue on Conv layers in the SDXL network on NVIDIA B200.

  • Fixed an issue where different batch sizes could generate different outputs when running OSS demoBERT FP16 inference on H20 and RTX5080 GPUs. This can now be resolved by generating an engine with the --distributive_independence flag.

  • Fixed an accuracy issue when running certain networks on NVIDIA HGX H20.

Known Issues#

Functional

  • There is a known failure with error message showing Failed to find fallback kernel when building the engine with Conv layers in INT8 precisions on Blackwell GPUs. If you encounter this, set the max_num_tactics field to 999 in the builder config or add the --maxTactics=999 flag to the trtexec command to work around it. This will be fixed in the next release.

  • When running the FLUX Transformer model in 2048x2048 spatial dimensions, it may produce NaN outputs. This can be worked around with different spatial dimensions.

  • Support for B100 and B200 on Windows is considered experimental. Some networks may fail to run due to missing kernels for this GPU and OS combination. We plan to improve this support in a future release, but its status will remain experimental at this time.

  • Inputs to the IRecurrenceLayer must always have the same shape. This means that ONNX models with loops whose recurrence inputs change shapes will be rejected.

  • CUDA compute sanitizer may report racecheck hazards for some legacy kernels. However, related kernels do not have functional issues at runtime.

  • The compute sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.

  • Multi-Head Attention fusion might not happen and affect performance if the number of heads is small.

  • An occurrence of use-after-free in NVRTC has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.

  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.

    {
        Memory leak errors with dlopen.
        Memcheck:Leak
        match-leak-kinds: definite
        ...
        fun:*dlopen*
        ...
      }
      {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.

  • Installing the cuda-compat-11-4 package may interfere with CUDA-enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470.

  • For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in kDLA_LINEAR format. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.

  • Exclusive padding with kAVERAGE pooling is not supported.

  • Asynchronous CUDA calls are not supported in the user-defined processDebugTensor function for the debug tensor feature due to a bug in Windows 10.

  • inplace_add mini-sample of the quickly_deployable_plugins Python sample may produce incorrect outputs on Windows. This will be fixed in a future release.

  • When linking with libcudart_static.a using a RedHat gcc-toolset-11 or earlier compiler, you may encounter an issue where exception handling isn’t working. When a throw or exception happens, the catch is ignored, and an abort is raised, killing the program. This may be related to a linker bug causing the eh_frame_hdr ELF segment to be empty. You can workaround this issue using a new linker, such as the one from gcc-toolset-13.

  • TensorRT may exit if inputs with invalid values are provided to the RoiAlign plugin (ROIAlign_TRT), especially if there is inconsistency in the indices specified in the batch_indices input and the actual batch size used.

  • In the Validate against Ground Truth section of the efficientnet samples, the link to download Caffe’s ILSVRC2012 auxiliary package is unstable. Therefore, the download might fail intermittently.

  • The ONNX specification of the NonMaxSuppression operation requires the iou_threshold parameter to be in the range of [0.0-1.0]. However, TensorRT does not validate the value of the parameter; therefore, TensorRT will accept values outside of this range, in which case, the engine will continue executing as if the value was capped at either end of this range.

  • PluginV2 in a loop or conditional scope is not supported. Upgrade to the PluginV3 interface as a WAR. This will impact some TensorRT-LLM models with GEMM plugins in a conditional scope.

  • There is a known host memory leak issue when building TensorRT engines on NVIDIA Blackwell GPUs.

  • DynamicQuantize does not support use cases where the batch dimension exceeds INT32_MAX.

Performance

  • Up to 7% performance regression compared to TensorRT 10.11 for VAE networks in INT8 precision.

  • Up to 20% performance regression compared to TensorRT 10.11 on Blackwell GPUs for some ConvNets.

  • Up to 6% performance regression compared to TensorRT 10.9 for ConvNext in INT8 precision on Hopper and Ampere GPUs.

  • CPU peak memory usage regression with the roberta_base engine on Ampere GPUs compared to TensorRT 10.7.

  • Up to 10% performance regression for Megatron networks in FP32 precision compared to TensorRT 10.8 for BS4.

  • Up to 100 MB context memory size regression compared to TensorRT 8.6 on Hopper GPUs for CRNN (Convolutional Recurrent Neural Network) models. Inference performance is not affected.

  • Up to 9% inference performance regression for StableDiffusion v2.0/2.1 VAE network in FP16 precision on Hopper GPUs compared to TensorRT 10.6 in CUDA 11.8 environment. This issue can be fixed by upgrading CUDA to 12.6.

  • Up to 60% performance regression compared to TensorRT 8.6 on Ampere GPUs for group convolutions with N channels per group, where N is not a power of 2. This can be worked around by padding N to the next power of 2

  • Up to 22% context memory size regression for HiFi-GAN networks in INT8 precision compared to TensorRT 10.5 on Ampere GPUs.

  • Up to 7% performance regression for Megatron networks in FP16 precision compared to TensorRT 10.6 for BS1 and Seq128 on H100 GPUs.

  • Up to 10% performance regression for BERT networks exported from TensorFlow2 in FP16 precision compared to TensorRT 10.4 for BS1 and Seq128 on A16 GPUs.

  • Up to 16% regression in context memory usage for StableDiffusion XL VAE network in FP8 precision on H100 GPUs compared to TensorRT 10.3 due to a necessary functional fix.

  • Up to 15% regressing in context memory usage for networks containing InstanceNorm and Activation ops compared to TensorRT 10.0.

  • Up to 15% CPU memory usage regression for mbart-cnn/mamba-370m in FP16 precision and OOTB mode on NVIDIA Ada Lovelace GPUs compared to TensorRT 10.2.

  • Up to 6% performance regression for BERT/Megatron networks in FP16 precision compared to TensorRT 10.2 for BS1 and Seq128 on H100 GPUs.

  • Up to 6% performance regression for Bidirectional LSTM in FP16 precision on H100 GPUs compared to TensorRT 10.2.

  • Performance gaps between engines built with REFIT enabled and engines built with REFIT disabled.

  • Up to 60 MB engine size fluctuations for the BERT-Large INT8-QDQ model on Orin due to unstable tactic selection among tactics.

  • Up to 16% performance regression for BasicUNet, DynUNet, and HighResNet in INT8 precision compared to TensorRT 9.3.

  • Up to 40-second increase in engine building for BART networks on NVIDIA Hopper GPUs.

  • Up to 20-second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.

  • Up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert-like models due to additional tactics available for evaluation.

  • Up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.

  • Up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.

  • Convolution on a tensor with an implicitly data-dependent shape may run slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.

  • Up to 5% performance drop for networks using sparsity in FP16 precision.

  • Up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.

  • Up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. Enable FP16 and disable INT8 in the builder config to work around this.

  • In explicitly quantized networks, a group convolution with a Q/DQ pair before but no Q/DQ pair after runs with INT8-IN-FP32-OUT mixed precision. However, NVIDIA Hopper may fall back to FP32-IN-FP32-OUT if the input channel count is small.

  • The kREFIT and kREFIT_IDENTICAL have performance regressions compared with non-refit engines where convolution layers are present within a branch or loop, and the precision is FP16/INT8. This issue will be addressed in future releases.