TensorRT 10.0.1 Release Notes#

These Release Notes apply to x86 Linux and Windows users, ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux, and JetPack users. This release includes several fixes from the previous TensorRT releases and additional changes.

Announcements#

glibc Reversion: For TensorRT 10.0.0 EA, the minimum glibc version for the Linux x86 build was 2.28. This toolchain change was reverted for TensorRT 10.0.1 GA and is now compatible with glibc 2.17, which was the minimum glibc version supported by TensorRT 8.6, restoring compatibility with older Linux distributions.

Platform Support Changes:

  • RedHat/CentOS 7.x: No longer officially supported starting with TensorRT 10.0.

  • RedHat/Rocky Linux 9.x: Now supported starting with TensorRT 10.0.

Python Version Support:

  • Python 3.6 and 3.7: Support has been dropped starting with TensorRT 10.0.

  • Python 3.12: Support has been added starting with TensorRT 10.0.

Breaking ABI Change on Windows: TensorRT 10.0 GA broke ABI compatibility relative to TensorRT 10.0 EA on Windows by adding the TensorRT major version to the DLL filename (nvinfer.dllnvinfer_10.dll). This allows applications to link against different TensorRT major versions simultaneously.

NVIDIA Volta Deprecation: NVIDIA Volta support (GPUs with compute capability 7.0) is deprecated starting with TensorRT 10.0 and may be removed after September 2024. Plan migration to supported GPU architectures.

Parser and Framework Removals: ICaffeParser, IUffParser, and the libnvparsers library have been removed. The uff, graphsurgeon, and related packages are no longer included in TensorRT packages. Use the ONNX parser for model import.

Plugin Deprecations: IPluginV2DynamicExt, IPluginV2IOExt, and IPluginCreator have been deprecated. Use IPluginV3 and IPluginCreatorV3One instead for new plugin development.

Key Features and Enhancements#

Weight Streaming

  • Large Model Support: Added a new kWEIGHT_STREAMING flag to the builder and streaming budget APIs in the runtime to enable running strongly typed models larger than device memory. For example, a strongly typed model with 32 GB of weights can run on a device with less than 32 GB of VRAM, enabling deployment of LLMs on resource-constrained GPUs.

Quantization Enhancements

  • INT4 Weight Only Quantization (WoQ): Added support for weight compression using INT4 on Hopper GPUs, enabling significant memory savings for large models. Note that WoQ currently performs extra copies that increase latency and will be further optimized in future releases.

  • Block Quantization: Added Block Quantization mode, allowing setting scales with high granularity (supported by INT4 WoQ only), improving quantization flexibility and accuracy.

Plugin System V3

  • IPluginV3 Framework: A new generation of TensorRT custom layers is now available with plugins implementing IPluginV3 and plugin creators implementing IPluginCreatorV3One. New features include: - Data-dependent output shapes - Shape tensor inputs - Custom tactics - Timing caching

  • Plugin Registry Enhancements: Added a key-value store to the plugin registry for registration and lookup of user-defined resources.

Memory and Resource Management

  • Runtime Allocation Strategies: createExecutionContext now accepts an argument specifying the allocation strategy (kSTATIC, kON_PROFILE_CHANGE, and kUSER_MANAGED) of execution context device memory. For user-managed allocation, a new API updateDeviceMemorySizeForShapes queries the required size based on actual input shapes.

  • Shared Memory Control: Added the kTACTIC_SHARED_MEMORY flag for control over the overall shared memory budget used for TensorRT backend CUDA kernels. This is useful when TensorRT must share GPUs with other applications. By default, the value is set to device max capability.

Refitting and Weight Management

  • REFIT_IDENTICAL Flag: The new REFIT_IDENTICAL flag instructs the TensorRT builder to optimize under the assumption that the engine will be refitted with weights identical to those provided at build time. Using this flag with kSTRIP_PLAN minimizes plan size in deployment scenarios where, for example, the plan is being shipped alongside an ONNX model containing the weights.

  • QAT Transformer Support: QAT transformer networks now work with refit, improving flexibility for quantization-aware training workflows.

Debugging and Inspection

  • Debug Tensors: Added an API to mark tensors as debug tensors at build time. At runtime, each time the value of the tensor is written, a user-defined callback function is invoked with the value, type, and dimensions, enabling detailed debugging of inference execution.

  • Layer Indexing: Added indexing for layer information (--dumpLayer) and profiling information (--dumpProfile). Layer names reported by IEngineInspector now match the layer names reported by IProfiler.

ONNX Parser Improvements

  • Enhanced Error Reporting: The ONNX parser returns the list of all nodes that can be statically determined as unsupported when the call to parse() fails. The error reporting contains node name, node type, reason for failure, and the local function stack if the node is located in an ONNX local function. The number of errors can be queried with getNbErrors(), and information about individual errors can be obtained from getError().

Developer Tools

Samples and Tools

  • Weight Stripping Sample: Added a new Python sample sample_weight_stripping to showcase building and refitting weight-stripped engines from ONNX models.

Packaging Improvements

  • Simplified Python Installation: The tensorrt Debian and RPM meta-packages now install the TensorRT Python binding packages (python3-libnvinfer, python3-libnvinfer-lean, and python3-libnvinfer-dispatch) automatically. Previously, installing the python3-libnvinfer-dev(el) package was required as well to support both C++ and Python.

Bug Fixes and Performance

  • Windows Plugin Library: Fixed an issue where the nvinfer_plugin.lib library was incorrectly distributed as a static linking library starting with TensorRT 9.0. TensorRT 10.0 reverts this library to a dynamic linking library matching the behavior of TensorRT 8.6.

  • BERT Performance: Fixed an up to 9% performance drop for BERT networks with gelu_erf activation in BF16 precision on NVIDIA Ampere GPUs.

  • ViT Performance: Fixed an up to 11% performance drop for ViT networks in TF32 precision on NVIDIA Ampere GPUs.

  • Temporal Fusion Transformers: Fixed an up to 23% performance regression for Temporal Fusion Transformers in FP32 precision on NVIDIA Turing and Ampere GPUs.

  • Builder Optimization: Fixed an issue where a higher builder optimization level did not always give better performance when compared to a lower builder optimization level (up to 27% performance difference).

  • SegResNet and Stable Diffusion: Fixed an up to 15% performance regression for SegResNet and Stable Diffusion VAE in FP16 precision.

  • Windows Stability: Fixed temporary DLL file cleanup on Windows when running in vc mode. Fixed crashes when building transformer-based networks on Windows 10 and H100.

  • Compute Sanitizer Compatibility: Fixed a known issue with the compute sanitizer in CUDA Toolkit 12.3 that might cause the target application to crash (fixed in CUDA Toolkit 12.4).

  • Multi-Head Attention (MHA) Refit: Fixed Multi-Head Attention (MHA) fusion to work with refit enabled.

  • Memory Management: Fixed an up to 144 MB peak GPU memory usage increase when building engines for ResNet-50 in INT8 precision on L4T Orin. Fixed hardware compatible engines built with CUDA versions older than 11.5 to no longer crash during inference on GPUs with lower compute capability.

  • Q/DQ Operations: Fixed numerical overflow issues when using FP16 scales for Q/DQ ops.

  • Large Tensor Support: Fixed UNets with tensors containing >2^31 elements to no longer fail during engine building.

  • TensorRT-LLM INT8: Fixed engine build failures in TensorRT-LLM with INT8 kv-cache due to insufficient custom scales.

  • ONNX Parser Refitter: Fixed the ONNX Parser Refitter to properly refit weights defined in nested ONNX structures such as If, Loop, or Scan operations.

  • MHA Output Matching: Fixed output mismatches in the _gemm_mha_v2 operation when building engines with FP16 precision.

  • Context Memory Stability: Fixed up to 14% context memory usage fluctuations for 3DUnet networks due to different tactics being selected.

API Enhancements

Breaking API Changes#

Attention

  • TensorRT 10.0 GA broke ABI compatibility relative to TensorRT 10.0 EA on Windows by adding the TensorRT major version to the DLL filename. TensorRT 10.0 EA and prior TensorRT releases have historically named the DLL file nvinfer.dll, while 10.0 GA renamed the DLL file nvinfer_10.dll. This same naming pattern was also applied to the other TensorRT DLL files in the zip package. We strive not to break backward compatibility between releases with the same major version, but this change will allow applications to link against different TensorRT major versions at the same time.

  • In TensorRT 9.0, due to the introduction of INT64 as a supported data type, ONNX models with INT64 I/O require INT64 bindings. Prior to this release, such models required INT32 bindings.

  • Release 10.0 GA enforces the restriction that NvInferRuntimeBase.h should not be directly included. The restriction was merely documented when 8.6 introduced the header.

  • In TensorRT 9.0, we removed ICaffeParser, IUffParser, and related classes and functions. The following APIs are removed:

    • nvcaffeparser1::IBlobNameToTensor

    • nvcaffeparser1::IBinaryProtoBlob

    • nvcaffeparser1::IPluginFactoryV2

    • nvcaffeparser1::ICaffeParser

    • nvcaffeparser1::createCaffeParser

    • nvcaffeparser1::shutdownProtobufLibrary

    • createNvCaffeParser_INTERNAL

    • nvinfer1::utils::reshapeWeights

    • nvinfer1::utils::reorderSubBuffers

    • nvinfer1::utils::ransposeSubBuffers

    • nvuffparser::UffInputOrder

    • nvuffparser::FieldType

    • nvuffparser::FieldMap

    • nvuffparser::FieldCollection

    • nvuffparser::IUffParser

    • nvuffparser::createUffParser

    • nvuffparser::shutdownProtobufLibrary

    • createNvUffParser_INTERNAL

  • With removal of ICaffeParser and IUffParsers, the libnvparsers library is removed.

  • uff, graphsurgeon, and related networks are removed from TensorRT packages.

  • TacticSource::kCUDNN and TacticSource::kCUBLAS are disabled by default. The cudnnContext* and cublasContext* parameters of the nvinfer1::IPluginV2Ext::attachToContext function are set to nullptrs when the corresponding TacticSource flags are unset.

  • IPluginCreatorInterface has been added as a base class to IPluginCreator.

  • Overloads have been added to the methods IPluginRegistry::deregisterCreator and IPluginRegistry::registerCreator that take in IPluginCreatorInterface references.

Compatibility#

Limitations#

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:

    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and

    • the second mode triggers in other cases if valid. This mode supports DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.

  • The DLA compiler can remove identity transposes but cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshaping). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshape in between, the shuffle layer is translated into two consecutive DLA transpose layers unless the user merges the transposes manually in the model definition in advance.

  • In explicitly quantized networks, a group convolution that has a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, on NVIDIA Hopper it may fall back to FP32-IN-FP32-OUT if the input channel count is small.

  • nvinfer1::UnaryOperation::kROUND or nvinfer1::UnaryOperation::kSIGN operations of IUnaryLayer are not supported in the implicit batch mode.

  • For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset containing the corresponding function ops, such as opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to the corresponding implementation with primitive ops for normalization layers.

  • QuantizeLayer and DequantizeLayer only supports FP32 scale and data, even when using ONNX opset 19. If the input is not FP32, you must add a Cast to FP32 on the input to QuantizeLayer, and a Cast from FP32 at the output of DequantizeLayer.

  • EngineInspector::getLayerInformation may return incomplete JSON data for some engines produced by TensorRT 9.0. When this happens, TensorRT Engine Explorer cannot be used to analyze the engine or generate a graph of the engine layers.

  • The kREFIT and kREFIT_IDENTICAL have performance regressions compared with non-refit engines where convolution layers are present within a branch or loop and the precision is FP16/INT8.

  • The new kTACTIC_SHARED_MEMORY flag cannot restrict shared memory usage for depthwise convolution, depth separate convolution, and certain corner case conv activation fused kernels. You need to run Nsight to verify the shared memory usage of the result engine.

  • Shape tensor inputs will not be added to TensorRT plugins implementing IPluginV3 by the TensorRT ONNX parser. All inputs will be passed as regular device inputs. This is in contrast to the addPluginV3 API which allows the specification of shape tensor inputs to be passed to the plugin.

  • Weight streaming currently does not work with CUDA Graph.

  • Multiple contexts for one engine with Weight Streaming enabled cannot run parallel on devices and will be serialized automatically.

  • Weight streaming mainly supports GEMM-based networks like Transformers for now. Convolution-based networks may have only a few weights that can be streamed.

  • IPluginRegistry::loadLibrary() and IPluginRegistry::deregisterLibrary() functionality is not supported for plugin shared libraries containing V3 plugin creators (IPluginCreatorV3One).

Deprecated API Lifetime#

  • APIs deprecated in TensorRT 10.0 will be retained until 3/2025.

Deprecated and Removed Features#

  • For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.

  • Deprecated NVIDIA Volta support (GPUs with compute capability 7.0) starting with TensorRT 10.0. Volta support may be removed after 9/2024.

  • Deprecated the kWEIGHTLESS builder flag. Superseded by the kSTRIP_PLAN builder flag. kSTRIP_PLAN works with either the kREFIT flag or the new kREFIT_IDENTICAL flag, defaulting to the latter if neither is set.

  • In 10.0, we removed deprecated APIs compared to 9.3 and earlier releases. These removed APIs were deprecated before March 2023. We ensured that Version Compatibility is expected between 8.6, 9.x, and 10.0 versions. Note that version compatibility is not supported for implicit batch mode, which was removed in 10.0. If you are unfamiliar with these changes, refer to our sample code for clarification. In light of the changes to the API in TensorRT 10.0, we’ve prepared an API Migration Guide to highlight the API modifications.

  • We removed implicit batch support and worked with networks as they always have an explicit batch.

  • Deprecated TacticSource::kCUDNN and TacticSource::kCUBLAS flags.

  • Deprecated IPluginV2DynamicExt. Use IPluginV3 instead.

  • IPluginCreator::getTensorRTVersion() has been removed.

  • Deprecated IPluginV2IOExt. Use IPluginV3 instead.

  • Deprecated IPluginCreator. There is no alternative factory class for IPluginV2-derivative plugin base classes, as they are all deprecated as well. Implement IPluginV3 and its corresponding factory class IPluginV3CreatorOne.

  • Deprecated the following APIs in IPluginRegistry:

    • IPluginRegistry::registerCreator(IPluginCreator&). Use its overload IPluginRegistry::registerCreator(IPluginCreatorInterface&) instead.

    • IPluginRegistry::deregisterCreator(IPluginCreator const&). Use its overload IPluginRegistry::deregisterCreator(IPluginCreatorInterface const&) instead.

    • IPluginRegistry::getPluginCreator. Use IPluginRegistry::getCreator instead.

    • IPluginRegistry::getPluginCreatorList. Use IPluginRegistry::getAllCreators instead.

Fixed Issues#

  • The nvinfer_plugin.lib library within the Windows package was incorrectly distributed as a static linking library starting with TensorRT 9.0. TensorRT 10.0 reverts this library to a dynamic linking library matching the behavior of TensorRT 8.6.

  • There was an up to 9% performance drop for BERT networks with gelu_erf activation in BF16 precision compared to TensorRT 9.1 on NVIDIA Ampere GPUs.

  • There was an up to 11% performance drop for ViT networks in TF32 precision compared to TensorRT 9.0 on NVIDIA Ampere GPUs.

  • There was an up to 23% performance regression compared to TensorRT 8.5 on Temporal Fusion Transformers in FP32 precision on NVIDIA Turing and NVIDIA Ampere GPUs.

  • A higher builder optimization level did not always give a better performance when compared to a lower builder optimization level; which could happen on all platforms and up to 27%. The workaround was to build an engine using a lower builder optimization level.

  • If an ONNX model contained a Range operator and its limit input was a data-dependent tensor, engine building would likely fail.

  • There was an up to 15% performance regression for SegResNet and StableDiffusion VAE in FP16 precision compared to TensorRT 9.3.

  • TensorRT did not clean temporary DLL files automatically on Windows when running in vc mode. The TensorRT library was internally holding open file references when the application finished.

  • TensorRT may have crashed when building transformer based networks on Windows 10 and H100.

  • There was a performance drop on QDQ-Gemm pattern on RTX-Titan in weightless mode.

  • There was a known issue with the compute sanitizer in CUDA Toolkit 12.3 that might cause the target application to crash. This has been fixed in CUDA Toolkit 12.4.

  • Indexing for layer information (--dumpLayer) and its profiling information (--dumpProfile) has been added; the layer names reported by IEngineInspector now match the layer names reported by IProfiler.

  • Multihead attention fusion now works with refit enabled.

  • There was an up to 144 MB peak GPU memory usage increase compared to TensorRT 8.6 when building engines for ResNet-50 in INT8 precision on the L4T Orin platform.

  • Hardware compatible engines built with CUDA versions older than 11.5 will no longer crash during inference when run on a GPU with a compute capability lower than that of the GPU where the engine was built.

  • Some networks fail at the engine building phase on Windows and H100, but can execute on Linux. The root cause is a builder issue, where fusion compilation fails.

  • Using FP16 scales for Q/DQ ops may have resulted in numerical overflow. The workaround was to use FP32 scales for Q/DQ ops instead. This issue has been fixed.

  • UNets with tensors containing >2^31 elements may have failed during the engine building step.

  • Running TensorRT-LLM with TensorRT 10.0 with INT8 kv-cache would result in engine build failure due to insufficient custom scales. The workaround was to enable StronglyTyped mode. This issue has been fixed.

  • There were up to 21% peak GPU memory usage fluctuations when building the engine for the same network back to back due to different tactics being selected.

  • The ONNX Parser Refitter could not refit weights defined in nested ONNX structures such as If, Loop, or Scan operations. This issue has been fixed.

  • When the _gemm_mha_v2 operation was used, the outputs mismatched the output of PyTorch or the CPU executor (onnxRT). This problem showed up only when building engines with FP16 precision, as _gemm_mha_v2 has an implementation only for FP16. This issue has been fixed.

  • Compute Sanitizer from CUDA Toolkit 12.0/12.1 may report a false alarm about invalid memory access in generatedNativePointwise kernels. This issue was fixed in CUDA Toolkit 12.2.

  • There were up to 14% context memory usage fluctuations compared to TensorRT 9.1 when building the engine for 3DUnet networks due to different tactics being selected. This issue has been fixed.

Known Issues#

Functional

  • When using the Polygraphy engine_from_network API, if we enable both refittable and strip_plan in the create_config, the final engine weights are not stripped. To workaround this, only include strip_plan in the create_config.

  • TensorRT does not support attention operations for tensors larger than int32_t maximum. Plugins can be used to workaround this issue. The issue will be fixed in a future release.

  • The API docs incorrectly state that Cast to the INT8 format is possible but this path is not supported. Use a QuantizeLinear node instead.

  • Allocated GPU memory during autotuning might not be freed correctly if allocation failed due to inadequate resources, causing build time memory usage to be larger than that of inference time.

  • When using refit on multi-head attention or if/while loops with explicit quantization, the refit process might be slow due to the implementation’s memcpyDeviceToHost for the Q/DQ scales. This issue will be addressed in a future release.

  • There are some issues when running TensorRT-LLM with TensorRT 10.0 with the StronglyTyped mode enabled. This can be worked around by disabling the StronglyTyped mode.

  • CUDA compute sanitizer may report racecheck hazards for some legacy kernels. However, related kernels do not have functional issues at runtime.

  • The compute sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.

  • Multi-Head Attention fusion might not happen and affect performance if the number of heads is small.

  • Hardware forward compatibility (HFC) is broken on L4T Concord for ViT, Swin-Transformers, and BERT networks in FP16 mode. A workaround is to only use FP32 mode on L4T Concord or turn off HFC.

  • If a network has a tensor of type bool with an implicitly data-dependent shape, engine building will likely fail.

  • An occurrence of use-after-free in NVRTC has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.

  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.

    {
        Memory leak errors with dlopen.
        Memcheck:Leak
        match-leak-kinds: definite
        ...
        fun:*dlopen*
        ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.

  • Installing the cuda-compat-11-4 package may interfere with CUDA-enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470.

  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.

  • For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in kDLA_LINEAR format. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.

  • Exclusive padding with kAVERAGE pooling is not supported.

  • Running sync/race check with newer Compute Sanitizer on L4T may hit a hang issue. The workaround is to try an older version of Compute Sanitizer.

  • The Valgrind tool found a memory leak on L4T with CUDA 12.4 due to a known driver issue. This is expected to be fixed in CUDA 12.6.

  • Asynchronous CUDA calls are not supported in the user-defined processDebugTensor function for the debug tensor feature due to a bug in Windows 10.

  • The sample sampleNonZeroPlugin fails to build when cross compiling for L4T. You can workaround this issue and continue building the other samples by modifying samples/Makefile and removing the line containing sampleNonZeroPlugin. This issue will be fixed in the next release.

  • The sample sampleNonZeroPlugin does not guarantee CUDA minor version compatibility. That is, if built against a newer CUDA Toolkit release, it may not function properly on older drivers, even within the same major CUDA release family.

  • LSTM networks could fail to build with timing cache enabled. This has been observed on one GPU platform and only when building with a cache that has pre-existing entries. Error signature would contain

    Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:operator():1502]
        Internal bug. Please report with reproduction steps.
    

    The workaround was to disable the timing cache or start a fresh one. This issue has been fixed.

  • IPluginRegistry::deregisterLibrary() will not work with plugin shared libraries with the defined entry point getPluginCreators(). IPluginRegistry::loadLibrary() is not impacted. To deregister plugins contained in such a library, manually query the library for getPluginCreators(), and invoke IPluginRegistry::deregisterCreator() for each creator retrieved.

Performance

  • There is an up to 9% performance regression for StableDiffusion VAE networks on A16 and A40 compared to TensorRT 9.2. This can be worked around by disabling the kNATIVE_INSTANCENORM flag in ONNX parser or adding the --pluginInstanceNorm flag to trtexec.

  • There is an up to 4x performance regression for networks containing GridSample ops compared to TensorRT 9.2.

  • Up to 60 MB engine size fluctuations for the BERT-Large INT8-QDQ model on Orin due to unstable tactic selection among tactics.

  • There is a small chance that TensorRT will hang when running on H100 with the r550 CUDA driver when CUDA graphs are used. A workaround is to use the r535 CUDA driver instead or to avoid using CUDA graphs.

  • Up to 16% performance regression for BasicUNet, DynUNet, and HighResNet in INT8 precision compared to TensorRT 9.3.

  • There are performance gaps for StableDiffusion networks between Windows and Linux platforms.

  • On A30, some fused MHA (Multi-Head Attention) performance is not optimized yet. This will be improved upon in future TensorRT versions.

  • Up to 40-second increase in engine building for BART networks on NVIDIA Hopper GPUs.

  • Up to 20-second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.

  • Up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert-like models due to additional tactics available for evaluation.

  • Up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.

  • Up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.

  • Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.

  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.

  • There is a known issue on H100 that may lead to GPU hang when running TensorRT with high persistentCache usage. Limit the usage to 40% of L2 cache size as a workaround.

  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).

  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.

  • Up to 5% performance drop for networks using sparsity in FP16 precision.

  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.

  • Up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.

  • Up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. Enable FP16 and disable INT8 in the builder config to work around this.