TensorRT-RTX 1.2 Release Notes#

These release notes describe NVIDIA TensorRT-RTX 1.2 and apply to x86-64 Linux and Windows users. Use them to identify the supported software components and versions in this release, review compatibility and support information, and locate the per-component release notes for new features, fixed issues, and known limitations.

Release Highlights

  • CUDA Graphs Support — Built-in CUDA Graphs with automatic dynamic shape support, enabling a one-line change to accelerate inference workflows by reducing GPU kernel launch overhead.

  • User Memory Allocation — New kREQUIRE_USER_ALLOCATION builder flag and IExecutionContext::isStreamCapturable() API for CUDA stream capture workflows.

  • CUDA 13.0 Support — Compatible with NVIDIA CUDA 13.0 Toolkit.

  • Library Reorganization — DLL libraries moved from the lib subdirectory to the bin subdirectory.

Announcements#

This release includes support for the CUDA 13.0 Toolkit, available on both Linux and Windows. Users should download the CUDA 12.9 and CUDA 13.0 TensorRT-RTX builds separately as needed.

Key Features and Enhancements#

CUDA Graphs Support

This release introduces built-in CUDA Graphs integration:

  • Built-in CUDA Graphs with automatic dynamic shape support are now available and can be enabled with a one-line change to existing workflows. This further accelerates inference workflows by reducing GPU kernel launch overhead at runtime.

Memory Management

This release adds APIs for application-managed memory and stream-capture compatibility:

  • A new kREQUIRE_USER_ALLOCATION builder flag requires that engines use application-provided memory where possible (in contrast to runtime-allocated memory). This is required when using CUDA stream capture. However, stream capture is not possible for all models, especially when using data-dependent dynamic shapes or certain on-device control flows. A new IExecutionContext::isStreamCapturable() API allows querying whether stream capture is possible in the current execution context.

Distribution

This release reorganizes the library layout:

  • The DLL libraries have moved from the lib subdirectory to the bin subdirectory.

Compatibility#

For comprehensive platform compatibility, hardware requirements, and feature availability information, refer to the TensorRT-RTX Support Matrix. The support matrix provides detailed information about supported operating systems, CUDA versions, GPU architectures, precision modes, compiler requirements, and ONNX operator support for TensorRT-RTX 1.2.

Limitations#

Warning

  • Timing cache deprecated: Using ITimingCache forces a GPU query during AOT compilation and has no effect on the built engine. The APIs are deprecated in this release and will be removed in a future update.

  • Engines are not forward-compatible: TensorRT-RTX engines must be rebuilt when upgrading the TensorRT-RTX runtime library.

  • Using a timing cache via ITimingCache and related APIs forces the ahead-of-time compilation step to query your system for a GPU (that is, prevents use of CPU-only AoT compilation). This has been true since the initial release (version 1.0). The timing cache has no effect on the built engine in TensorRT-RTX. We have deprecated the timing cache APIs in 1.2 and will remove them in a future update. They remain in place to avoid breaking source or binary compatibility. Applications should stop using timing cache APIs in preparation for their removal.

  • If the cache size grows too large (larger than 100MB), it may require more overhead to de/serialize to and from disk. If it negatively affects performance, delete the cache file and recreate one.

  • TensorRT-RTX engines are not forward-compatible with other versions of the TensorRT-RTX runtime. Ensure that any TensorRT-RTX engines you produce are run using the runtime from the same version of TensorRT-RTX which was used to generate the engine.

  • While TensorRT-RTX supports Turing (CUDA compute capability 7.5), a TensorRT-RTX engine created with default Compute Capability settings will produce an engine with support for Ampere and later GPUs, therefore excluding Turing. It is recommended to build a separate engine specifically for Turing to achieve the best performance. Creating a single engine with support for Turing and later GPUs will lead to less performant inference on the Ampere and later GPUs due to technical limitations of the engine format.

Deprecated API Lifetime#

  • APIs deprecated in TensorRT-RTX 1.2 will be retained until 10/2026.

  • APIs deprecated in TensorRT-RTX 1.1 will be retained until 8/2026.

  • APIs deprecated in TensorRT-RTX 1.0 will be retained until 6/2026.

Deprecated and Removed Features#

The following features have been deprecated or removed in TensorRT-RTX 1.2:

  • ITimingCache APIs are deprecated in this release. Refer to the Limitations section for more information.

Fixed Issues#

  • Fixed an issue where building weight-stripped engines would fail for ONNX models containing quantization and dequantization (Q/DQ) nodes.

  • The tensorrt_rtx executable on Windows now has long-path support enabled natively.

  • Reduced build time for the weights-stripped AoT engine used in LLM models.

  • Improved performance of repeated inference for models with dynamic shapes via smarter caching mechanisms.

  • Dramatically improved VRAM usage for models where order of execution affects tensor liveness analysis.

  • Fixed a performance issue in the Runtime Cache where deserialized runtime caches leaked GPU memory until program teardown. Users can now free the runtime cache’s memory at any time.

  • Reduced build time for the weights-stripped AoT engine used in diffusion models.

Known Issues#

Functional

  • NonMaxSuppression, NonZero, and Multinomial layers are not supported.

  • Only the WDDM driver for Windows is supported. The TCC driver for Windows (refer to Tesla Compute Cluster (TCC)) is unsupported and may fail with the following error.

    [E] Error[1]: [defaultAllocator.cpp::nvinfer1::internal::DefaultAllocator::allocateAsync::48] Error Code 1: Cuda Runtime (operation not supported)
    

    For instructions on changing the driver mode, refer to the Nsight Visual Studio Edition documentation.

  • When using TensorRT-RTX with the PyCUDA library in Python, use import pycuda.autoprimaryctx instead of import pycuda.autoinit in order to avoid device conflicts.

  • Depthwise convolutions and deconvolutions for BF16 precision are not supported.

  • Convolutions and deconvolutions with both non-unit strides and dilations are not supported for all precisions. Non-unit strided convolutions and deconvolutions, and non-unit dilated convolutions and deconvolutions are supported.

  • On Windows, the following symbols in tensorrt_rtx_1_2.dll should not be used and will be removed in the future:

    • ?disableInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Z

    • ?enableInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Z

    • ?getInternalBuildFlags@nvinfer1@@YA_KAEBVINetworkDefinition@1@@Z

    • ?setDebugOutput@nvinfer1@@YAXAEAVIExecutionContext@1@PEAVIDebugOutput@1@@Z

    • ?setInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Z

    • nvinfer1DisableInternalBuildFlags

    • nvinfer1EnableInternalBuildFlags

    • nvinfer1GetInternalBuildFlags

    • nvinfer1SetInternalBuildFlags

Performance

  • A subtle data race can occasionally impede CUDA graph capture in dynamic shapes using default settings. While TensorRT-RTX will remain functional, performance may be degraded (compared to best CUDA-graph-enabled performance) if the data race occurs. This will be fixed in the next release. For short-term mitigation, execute inference with eager dynamic-shapes kernel specialization.

  • Use of the CPU-only Ahead-of-Time (AOT) feature can lead to reduced performance for some models, particularly those with multi-head attention (MHA) due to CPU-only AOT’s use of conservative shared memory limits. Affected applications will achieve the best performance if they instead perform AOT compilation on-device, targeted to the specific end-user machine. This can be done with the --useGPU flag for the tensorrt_rtx binary, or if using the APIs by setting the compute capabilities to contain only kCURRENT using IBuilderConfig::setComputeCapability(). You can measure performance with both approaches to determine the approach that is best for your application. We plan to resolve this performance discrepancy in a future release.

  • We have prioritized optimizing the performance for 16-bit floating point types, and such models will frequently achieve throughput using TensorRT-RTX that is very close to that achieved with TensorRT. Models that heavily use 32-bit floating point types will still see improvement, but performance will tend to not be as strong as that achieved using TensorRT. Expect performance across many models and data types to improve in future versions of TensorRT-RTX.

  • Background kernel compilations, triggered as part of the dynamic shapes specialization strategy, are opportunistic and are not guaranteed to complete before network execution finishes. If your dynamic-shapes workload contains a fixed set of shapes, consider using the eager specialization strategy along with the runtime cache to load and store kernels quickly for best performance.

  • When running in CiG mode, some models show significantly reduced performance compared to non-CiG mode due to suboptimal kernels that are compatible with the shared memory limitations.

  • Convolutions and deconvolutions with large filter sizes may have degraded performance. We plan to improve performance on such cases in a future release.