TensorRT 10.0.0 Early Access (EA) Release Notes#

These Release Notes apply to x86 Linux and Windows users, ARM-based CPU cores for Server Base System Architecture (SBSA) users on Linux, and JetPack users. This release includes several fixes from the previous TensorRT releases and additional changes.

Announcements#

Attention

This is an Early Access (EA) release. Some features and APIs may change in the General Availability (GA) release.

glibc Version Change (Temporary): For TensorRT 10.0.0 EA, the minimum glibc version for the Linux x86 build is 2.28, making it compatible with RedHat 8.x (and derivatives), newer RedHat distributions, and Ubuntu 20.04 and newer. This toolchain change will be reverted for TensorRT 10.0 GA and will be compatible with glibc 2.17, which was the minimum glibc version supported by TensorRT 8.6.

Platform Support Changes:

  • RedHat/CentOS 7.x: No longer officially supported starting with TensorRT 10.0.

  • RedHat/Rocky Linux 9.x: Now supported starting with TensorRT 10.0.

Python Version Support:

  • Python 3.6 and 3.7: Support has been dropped starting with TensorRT 10.0.

  • Python 3.12: Support has been added starting with TensorRT 10.0.

Breaking ABI Change (Coming in GA): TensorRT 10.0 GA will break ABI compatibility relative to TensorRT 10.0 EA on Windows by adding the TensorRT major version to the DLL filename (nvinfer.dllnvinfer_10.dll). This allows applications to link against different TensorRT major versions simultaneously.

Parser and Framework Removals: ICaffeParser, IUffParser, and the libnvparsers library have been removed. The uff, graphsurgeon, and related packages are no longer included in TensorRT packages. Use the ONNX parser for model import.

Plugin Deprecations: IPluginV2DynamicExt, IPluginV2IOExt, and IPluginCreator have been deprecated. Use IPluginV3 and IPluginCreatorV3One instead for new plugin development.

Key Features and Enhancements#

Weight Streaming (Preview)

  • Large Model Support: Added a new kWEIGHT_STREAMING flag to the builder and streaming budget APIs in the runtime to enable running strongly typed models larger than device memory. For example, a strongly typed model with 32 GB of weights can run on a device with less than 32 GB of VRAM, enabling deployment of LLMs on resource-constrained GPUs.

    Note

    This is a preview feature. Weight streaming currently does not work with CUDA Graph, and multiple contexts for one engine with weight streaming enabled cannot run in parallel on devices (they will be serialized automatically).

Quantization Enhancements

  • INT4 Weight Only Quantization (WoQ): Added support for weight compression using INT4 on Hopper GPUs, enabling significant memory savings for large models. Note that WoQ currently performs extra copies that increase latency and will be further optimized in future releases.

  • Block Quantization: Added Block Quantization mode, allowing setting scales with high granularity (supported by INT4 WoQ only), improving quantization flexibility and accuracy.

Plugin System V3

  • IPluginV3 Framework: A new generation of TensorRT custom layers is now available with plugins implementing IPluginV3 and plugin creators implementing IPluginCreatorV3One. New features include: - Data-dependent output shapes - Shape tensor inputs - Custom tactics - Timing caching

  • Plugin Registry Enhancements: Added a key-value store to the plugin registry for registration and lookup of user-defined resources.

Memory and Resource Management

  • Runtime Allocation Strategies: createExecutionContext now accepts an argument specifying the allocation strategy (kSTATIC, kON_PROFILE_CHANGE, and kUSER_MANAGED) of execution context device memory. For user-managed allocation, a new API updateDeviceMemorySizeForShapes queries the required size based on actual input shapes.

  • Shared Memory Control: Added the kTACTIC_SHARED_MEMORY flag for control over the overall shared memory budget used for TensorRT backend CUDA kernels. This is useful when TensorRT must share GPUs with other applications. By default, the value is set to device max capability.

Refitting and Weight Management

  • REFIT_IDENTICAL Flag: The new REFIT_IDENTICAL flag instructs the TensorRT builder to optimize under the assumption that the engine will be refitted with weights identical to those provided at build time. Using this flag with kSTRIP_PLAN minimizes plan size in deployment scenarios where, for example, the plan is being shipped alongside an ONNX model containing the weights.

  • QAT Transformer Support: QAT transformer networks now work with refit, improving flexibility for quantization-aware training workflows.

Debugging and Inspection

  • Debug Tensors: Added an API to mark tensors as debug tensors at build time. At runtime, each time the value of the tensor is written, a user-defined callback function is invoked with the value, type, and dimensions, enabling detailed debugging of inference execution.

    Note

    Indexing for layer information (--dumpLayer) and profiling information (--dumpProfile) will be added in the GA release.

ONNX Parser Improvements

  • Enhanced Error Reporting: The ONNX parser returns the list of all nodes that can be statically determined as unsupported when the call to parse() fails. The error reporting contains node name, node type, reason for failure, and the local function stack if the node is located in an ONNX local function. The number of errors can be queried with getNbErrors(), and information about individual errors can be obtained from getError().

Developer Tools

Samples and Tools

  • Weight Stripping Sample: Added a new Python sample sample_weight_stripping to showcase building and refitting weight-stripped engines from ONNX models.

Packaging Improvements

  • Simplified Python Installation: The tensorrt Debian and RPM meta-packages now install the TensorRT Python binding packages (python3-libnvinfer, python3-libnvinfer-lean, and python3-libnvinfer-dispatch) automatically. Previously, installing the python3-libnvinfer-dev(el) package was required as well to support both C++ and Python.

Bug Fixes and Performance

  • Windows Plugin Library: Fixed an issue where the nvinfer_plugin.lib library was incorrectly distributed as a static linking library starting with TensorRT 9.0. TensorRT 10.0 reverts this library to a dynamic linking library matching the behavior of TensorRT 8.6.

  • BERT Performance: Fixed an up to 9% performance drop for BERT networks with gelu_erf activation in BF16 precision on NVIDIA Ampere GPUs.

  • ViT Performance: Fixed an up to 11% performance drop for ViT networks in TF32 precision on NVIDIA Ampere GPUs.

  • Temporal Fusion Transformers: Fixed an up to 23% performance regression for Temporal Fusion Transformers in FP32 precision on NVIDIA Turing and Ampere GPUs.

  • Builder Optimization: Fixed an issue where a higher builder optimization level did not always give better performance when compared to a lower builder optimization level (up to 27% performance difference).

  • ONNX Range Operator: Fixed an issue where engine building would likely fail if an ONNX model contained a Range operator with a data-dependent limit input.

API Enhancements

Breaking API Changes#

Attention

  • TensorRT 10.0 GA will break ABI compatibility relative to TensorRT 10.0 EA on Windows by adding the TensorRT major version to the DLL filename. TensorRT 10.0 EA and prior TensorRT releases have historically named the DLL file nvinfer.dll, while 10.0 GA will rename the DLL file nvinfer_10.dll. This same naming pattern will also apply to the other TensorRT DLL files in the zip package. We strive not to break backward compatibility between releases with the same major version, but this change will allow applications to link against different TensorRT major versions at the same time.

  • In TensorRT 9.0, due to the introduction of INT64 as a supported data type, ONNX models with INT64 I/O require INT64 bindings. Prior to this release, such models required INT32 bindings.

  • In TensorRT 9.0, ICaffeParser, IUffParser, and related classes and functions were removed. The following APIs are removed:

    • nvcaffeparser1::IBlobNameToTensor

    • nvcaffeparser1::IBinaryProtoBlob

    • nvcaffeparser1::IPluginFactoryV2

    • nvcaffeparser1::ICaffeParser

    • nvcaffeparser1::createCaffeParser

    • nvcaffeparser1::shutdownProtobufLibrary

    • createNvCaffeParser_INTERNAL

    • nvinfer1::utils::reshapeWeights

    • nvinfer1::utils::reorderSubBuffers

    • nvinfer1::utils::ransposeSubBuffers

    • nvuffparser::UffInputOrder

    • nvuffparser::FieldType

    • nvuffparser::FieldMap

    • nvuffparser::FieldCollection

    • nvuffparser::IUffParser

    • nvuffparser::createUffParser

    • nvuffparser::shutdownProtobufLibrary

    • createNvUffParser_INTERNAL

  • With removal of ICaffeParser and IUffParsers, the libnvparsers library is removed.

  • uff, graphsurgeon, and related networks are removed from TensorRT packages.

  • TacticSource::kCUDNN and TacticSource::kCUBLAS are disabled by default. The cudnnContext* and cublasContext* parameters of the nvinfer1::IPluginV2Ext::attachToContext function are set to nullptrs when the corresponding TacticSource flags are unset.

  • IPluginCreatorInterface has been added as a base class to IPluginCreator.

  • Overloads have been added to the methods IPluginRegistry::deregisterCreator and IPluginRegistry::registerCreator that take in IPluginCreatorInterface references.

Compatibility#

Limitations#

  • There are two modes of DLA softmax where the mode is chosen automatically based on the shape of the input tensor, where:

    • the first mode triggers when all nonbatch, non-axis dimensions are 1, and

    • the second mode triggers in other cases if valid. This mode supports DLA 3.9.0 and later. It involves approximations that may result in errors of a small degree. Also, batch size greater than 1 is supported only for DLA 3.9.0 and later.

  • On QNX, networks that are segmented into a large number of DLA loadables may fail during inference.

  • The DLA compiler can remove identity transposes but cannot fuse multiple adjacent transpose layers into a single transpose layer (likewise for reshaping). For example, given a TensorRT IShuffleLayer consisting of two non-trivial transposes and an identity reshape in between, the shuffle layer is translated into two consecutive DLA transpose layers unless the user merges the transposes manually in the model definition in advance.

  • In explicitly quantized networks, a group convolution that has a Q/DQ pair before but no Q/DQ pair after is expected to run with INT8-IN-FP32-OUT mixed precision. However, on NVIDIA Hopper it may fall back to FP32-IN-FP32-OUT if the input channel count is small.

  • nvinfer1::UnaryOperation::kROUND or nvinfer1::UnaryOperation::kSIGN operations of IUnaryLayer are not supported in the implicit batch mode.

  • For networks containing normalization layers, particularly if deploying with mixed precision, target the latest ONNX opset containing the corresponding function ops, such as opset 17 for LayerNormalization or opset 18 GroupNormalization. Numerical accuracy using function ops is superior to the corresponding implementation with primitive ops for normalization layers.

  • QuantizeLayer and DequantizeLayer only supports FP32 scale and data, even when using ONNX opset 19. If the input is not FP32, you must add a Cast to FP32 on the input to QuantizeLayer, and a Cast from FP32 at the output of DequantizeLayer.

  • EngineInspector::getLayerInformation may return incomplete JSON data for some engines produced by TensorRT 9.0. When this happens, TensorRT Engine Explorer cannot be used to analyze the engine or generate a graph of the engine layers.

  • The kREFIT and kREFIT_IDENTICAL have performance regressions compared with non-refit engines where convolution layers are present within a branch or loop and the precision is FP16/INT8.

  • The new kTACTIC_SHARED_MEMORY flag cannot restrict shared memory usage for depthwise convolution, depth separate convolution, and certain corner case conv activation fused kernels. You need to run Nsight to verify the shared memory usage of the result engine.

  • Shape tensor inputs will not be added to TensorRT plugins implementing IPluginV3 by the TensorRT ONNX parser. All inputs will be passed as regular device inputs. This is in contrast to the addPluginV3 API which allows the specification of shape tensor inputs to be passed to the plugin.

  • Weight streaming currently does not work with CUDA Graph.

  • Multiple contexts for one engine with Weight Streaming enabled cannot run parallel on devices and will be serialized automatically.

  • Weight streaming mainly supports GEMM-based networks like Transformers for now. Convolution-based networks may have only a few weights that can be streamed.

Deprecated API Lifetime#

  • APIs deprecated in TensorRT 10.0 will be retained until 3/2025.

Deprecated and Removed Features#

  • For a complete list of deprecated C++ APIs, refer to the C++ API Deprecated List.

  • Deprecated the kWEIGHTLESS builder flag. Superseded by the kSTRIP_PLAN builder flag. kSTRIP_PLAN works with either the kREFIT flag or the new kREFIT_IDENTICAL flag, defaulting to the latter if neither is set.

  • In 10.0, we removed deprecated APIs compared to 9.3 and earlier releases. These removed APIs were deprecated before March 2023. We ensured that Version Compatibility is expected between 8.6, 9.x, and 10.0 versions. Note that version compatibility is not supported for implicit batch mode, which was removed in 10.0. If you are unfamiliar with these changes, refer to our sample code for clarification. In light of the changes to the API in TensorRT 10.0, we’ve prepared an API Migration Guide to highlight the API modifications.

  • We removed implicit batch support and worked with networks as they always have an explicit batch.

  • Deprecated TacticSource::kCUDNN and TacticSource::kCUBLAS flags.

  • Deprecated IPluginV2DynamicExt. Use IPluginV3 instead.

  • IPluginCreator::getTensorRTVersion() has been removed.

  • Deprecated IPluginV2IOExt. Use IPluginV3 instead.

  • Deprecated IPluginCreator. There is no alternative factory class for IPluginV2-derivative plugin base classes, as they are all deprecated as well. Implement IPluginV3 and its corresponding factory class IPluginV3CreatorOne.

  • Deprecated the following APIs in IPluginRegistry:

    • IPluginRegistry::registerCreator(IPluginCreator&). Use its overload IPluginRegistry::registerCreator(IPluginCreatorInterface&) instead.

    • IPluginRegistry::deregisterCreator(IPluginCreator const&). Use its overload IPluginRegistry::deregisterCreator(IPluginCreatorInterface const&) instead.

    • IPluginRegistry::getPluginCreator. Use IPluginRegistry::getCreator instead.

    • IPluginRegistry::getPluginCreatorList. Use IPluginRegistry::getAllCreators instead.

Fixed Issues#

  • The nvinfer_plugin.lib library within the Windows package was incorrectly distributed as a static linking library starting with TensorRT 9.0. TensorRT 10.0 reverts this library to a dynamic linking library matching the behavior of TensorRT 8.6.

  • There was an up to 9% performance drop for BERT networks with gelu_erf activation in BF16 precision compared to TensorRT 9.1 on NVIDIA Ampere GPUs.

  • There was an up to 11% performance drop for ViT networks in TF32 precision compared to TensorRT 9.0 on NVIDIA Ampere GPUs.

  • There was an up to 23% performance regression compared to TensorRT 8.5 on Temporal Fusion Transformers in FP32 precision on NVIDIA Turing and NVIDIA Ampere GPUs.

  • A higher builder optimization level did not always give a better performance when compared to a lower builder optimization level; which could happen on all platforms and up to 27%. The workaround was to build an engine using a lower builder optimization level.

  • If an ONNX model contained a Range operator and its limit input was a data-dependent tensor, engine building would likely fail.

Known Issues#

Functional

  • Indexing for layer information (--dumpLayer) and its profiling information (--dumpProfile) will be added in the GA release. Currently, you may see duplicating layer names if the layer consists of identical components.

  • CUDA compute sanitizer may report racecheck hazards for some legacy kernels. However, related kernels do not have functional issues at runtime.

  • There is a known issue that the compute sanitizer in CUDA Toolkit 12.3 might cause target application crash.

  • The compute sanitizer initcheck tool may flag false positive Uninitialized __global__ memory read errors when running TensorRT applications on NVIDIA Hopper GPUs. These errors can be safely ignored and will be fixed in an upcoming CUDA release.

  • Multi-Head Attention fusion might not happen and affect performance if the number of heads is small.

  • Hardware forward compatibility (HFC) is broken on L4T Concord for ViT, Swin-Transformers, and BERT networks in FP16 mode. A workaround is to only use FP32 mode on L4T Concord or turn off HFC.

  • Compute Sanitizer from CUDA Toolkit 12.0/12.1 may report a false alarm about invalid memory access in generatedNativePointwise kernels.

  • If a network has a tensor of type bool with an implicitly data-dependent shape, engine building will likely fail.

  • An occurrence of use-after-free in NVRTC has been fixed in CUDA 12.1. When using NVRTC from CUDA 12.0 together with the TensorRT static library, you may encounter a crash in certain scenarios. Linking the NVRTC and PTXJIT compiler from CUDA 12.1 or newer will resolve this issue.

  • Although the version compatible runtime is optimized for efficiency, it may result in slower performance than the full runtime in certain use cases. Most networks can expect no more than a 10% slowdown when using a version-compatible engine compared to a version-locked engine. However, in some cases, a larger performance drop may occur. For example:

    • When running ResNet50_v2 with QAT, there may be up to a 11% decrease in performance.

    • When running DynUNet in FP16 precision, there may be up to a 32% decrease in performance.

  • There are known issues reported by the Valgrind memory leak check tool when detecting potential memory leaks from TensorRT applications. The recommendation to suppress the issues is to provide a Valgrind suppression file with the following contents when running the Valgrind memory leak check tool. Add the option --keep-debuginfo=yes to the Valgrind command line to suppress these errors.

    {
        Memory leak errors with dlopen.
        Memcheck:Leak
        match-leak-kinds: definite
        ...
        fun:*dlopen*
        ...
    }
    {
        Memory leak errors with nvrtc
        Memcheck:Leak
        match-leak-kinds: definite
        fun:malloc
        obj:*libnvrtc.so*
        ...
    }
    
  • SM 7.5 and earlier devices may not have INT8 implementations for all layers with Q/DQ nodes. In this case, you will encounter a could not find any implementation error while building your engine. To resolve this, remove the Q/DQ nodes, which quantize the failing layers.

  • Installing the cuda-compat-11-4 package may interfere with CUDA-enhanced compatibility and cause TensorRT to fail even when the driver is r465. The workaround is to remove the cuda-compat-11-4 package or upgrade the driver to r470.

  • For some networks, using a batch size of 4096 may cause accuracy degradation on DLA.

  • Hardware compatible engines built with CUDA versions older than 11.5 may crash during inference when run on a GPU with a compute capability lower than that of the GPU where the engine was built. A workaround is to build an engine on the GPU with the lowest compute capability.

  • For broadcasting elementwise layers running on DLA with GPU fallback enabled with one NxCxHxW input and one Nx1x1x1 input, there is a known accuracy issue if at least one of the inputs is consumed in kDLA_LINEAR format. It is recommended to explicitly set the input formats of such elementwise layers to different tensor formats.

  • The ONNX Parser Refitter cannot refit weights defined in nested ONNX structures such as If, Loop, or Scan operations. In these cases it’s recommended to perform the refit directly through the TensorRT APIs.

  • BERT-like networks with QAT may not build engines successfully with refit on.

  • The OnnxParserRefitter Python API documentation is missing. Refer to Refitting a Weight-Stripped Engine Directly from ONNX on how to use this class in Python.

  • Exclusive padding with kAVERAGE pooling is not supported.

  • If the _gemm_mha_v2 operation is used, the outputs will mismatch the output of PyTorch or the CPU executor. This problem may show up only when building engines with FP16 precision, as _gemm_mha_v2 has an implementation only for FP16.

  • The layer names reported by IEngineInspector may not match the layer names reported by IProfiler.

  • TensorRT does not clean temporary DLL files automatically on Windows when running in vc mode.

  • TensorRT may crash when building transformer based networks on Windows 10 and H100.

  • Running sync/race check with newer Compute Sanitizer on L4T may hit a hang issue. The workaround is to try an older version of Compute Sanitizer.

  • There is an accuracy drop running DINO-FAN-base models compared to TensorRT 8.6.1.6.

  • The Valgrind tool found a memory leak on L4T with CUDA 12.4 due to a known driver issue. This is expected to be fixed in CUDA 12.6.

  • Some networks fail at the engine building phase on Windows + H100, but can execute on Linux. The root cause is a builder issue, where fusion compilation fails.

  • While running some networks on Windows in version compatible mode (--vc), you may see an error: Unable to remove temporary DLL file when an application, such as trtexec, is finished. The TensorRT library internally is still holding open file references when the application finishes. This issue does not have an impact on model performance and it should be fixed in the next release.

Performance

  • Using FP16 scales for Q/DQ ops may result in numerical overflow. If this happens, use FP32 scales for Q/DQ ops instead.

  • There is an up to 15% performance regression for SegResNet and StableDiffusion VAE in FP16 precision compared to TensorRT 9.3.

  • There is an up to 16% performance regression for BasicUNet, DynUNet, and HighResNet in INT8 and FP16 precision compared to TensorRT 9.3.

  • There is an up to 144 MB peak GPU memory usage increase compared to TensorRT 8.6 when building engines for ResNet-50 in INT8 precision on the L4T Orin platform.

  • There is a performance drop on QDQ-Gemm pattern on RTX-Titan in weightless mode.

  • There are performance gaps for StableDiffusion networks between Windows and Linux platforms.

  • UNet models with tensors containing >2^31 elements may fail during the engine building step.

  • On A30, some fused MHA (Multi-Head Attention) performance is not optimized yet. This will be improved upon in future TensorRT versions.

  • There is up to 100% engine size increase for Transformer networks on Windows in FP16 precision.

  • Enabling refit breaks multihead attention fusions.

  • Running TensorRT-LLM with TensorRT 10.0 with INT8 kv-cache results in engine build failure due to insufficient custom scales. This workaround is to enable StronglyTyped mode.

  • Up to 40-second increase in engine building for BART networks on NVIDIA Hopper GPUs.

  • There is an accuracy drop running OSS HuggingFace Demo gptj-6b model when batch size > 1.

  • There is an up to 14% context memory usage fluctuations compared to TensorRT 9.1 when building the engine for 3DUnet networks due to different tactics being selected.

  • Up to 20-second increase in engine building for some large language models (LLMs) on NVIDIA Ampere GPUs.

  • There are up to 21% peak GPU memory usage fluctuations when building the engine for the same network back to back due to different tactics being selected.

  • Up to 2.5x build time increase compared to TensorRT 9.0 for certain Bert-like models due to additional tactics available for evaluation.

  • Up to 13% performance drop for the CortanaASR model on NVIDIA Ampere GPUs compared to TensorRT 8.5.

  • Up to 18% performance drop for the ShuffleNet model on A30/A40 compared to TensorRT 8.5.1.

  • Convolution on a tensor with an implicitly data-dependent shape may run significantly slower than on other tensors of the same size. Refer to the Glossary for the definition of implicitly data-dependent shapes.

  • For some Transformer models, including ViT, Swin-Transformer, and DETR, there is a performance drop in INT8 precision (including both explicit and implicit quantization) compared to FP16 precision.

  • There is a known issue on H100 that may lead to GPU hang when running TensorRT with high persistentCache usage. Limit the usage to 40% of L2 cache size as a workaround.

  • There is a known performance issue when running instance normalization layers on Arm Server Base System Architecture (SBSA).

  • There is a performance drop when offloading a SoftMax layer to DLA on NVIDIA Orin as compared to when running the layer on a GPU, with a larger drop for larger batch sizes. As an example, FP16 AlexNet with batch size 16 shows 32% drop when the network runs on DLA as compared to when the last SoftMax layer runs on a GPU.

  • There is a known issue with DLA clocks that requires users to reboot the system after changing the nvpmodel power mode or otherwise experience a performance drop. Refer to the L4T board support package Release Notes for details.

  • Up to 5% performance drop for networks using sparsity in FP16 precision.

  • H100 performance for some LSTMs in FP16 precision is not fully optimized. This will be improved in future TensorRT versions.

  • Up to 6% performance regression compared to TensorRT 8.5 on OpenRoadNet in FP16 precision on NVIDIA A10 GPUs.

  • Up to 70% performance regression compared to TensorRT 8.6 on BERT networks in INT8 precision with FP16 disabled on L4 GPUs. Enable FP16 and disable INT8 in the builder config to work around this.