TensorRT-RTX 1.4 Release Notes#

These release notes describe NVIDIA TensorRT-RTX 1.4 and apply to x86-64 Linux and Windows users. Use them to identify the supported software components and versions in this release, review compatibility and support information, and locate the per-component release notes for new features, fixed issues, and known limitations.

Release Highlights

  • CUDA 13.2 Support — Compatible with NVIDIA CUDA 13.2 Toolkit, with continued support for CUDA 12.9 Update 1.

  • GPU Latency Optimizations — Optimized 1D convolution and GEMV kernels, a new Windows backend for batch-size-1 convolutions, improved JIT compilation heuristics and speed, and enhanced multi-head attention (MHA) performance.

  • PyPI Availability — Install TensorRT-RTX Python bindings and shared library with a single pip install tensorrt-rtx command.

  • API Capture and Replay — New debugging feature that records TensorRT-RTX API calls during engine building and replays them for issue reproduction without requiring the original application or model source code (Linux only).

  • Compute-in-Graphics (CiG) Improvements — Resolved performance issues and segmentation faults on Blackwell GPUs, and improved multi-head attention kernel shared memory handling.

Key Features and Enhancements#

CUDA 13.2 Support

This release adds support for an additional CUDA toolkit version:

  • TensorRT-RTX now supports NVIDIA CUDA Toolkit 13.2.

Distribution

This release expands installation options:

  • TensorRT-RTX is now available on PyPI, enabling Python users to install the TensorRT-RTX Python bindings and shared library with a single pip install tensorrt-rtx command. See PyPI Installation for details.

GPU Latency Optimizations

This release improves GPU latency with the following optimizations:

  • 1D convolution kernel performance has been improved.

  • GEMV kernel performance has been improved.

  • A new backend for convolution operators with batch size = 1 is available on Windows.

  • JIT compilation heuristics for optimized kernel selection have been improved.

  • Kernel fusion coverage and graph optimization have been expanded.

  • JIT compilation time has been reduced.

  • Multi-head attention (MHA) performance has been improved when the CPU-only Ahead-of-Time feature is active.

Developer Tools

This release adds new developer tooling:

  • A new API Capture and Replay feature records TensorRT-RTX API calls during engine building and replays them for debugging and issue reproduction, without requiring the original application or model source code. This feature is currently restricted to Linux.

Compatibility#

For comprehensive platform compatibility, hardware requirements, and feature availability information, refer to the TensorRT-RTX Support Matrix. The support matrix provides detailed information about supported operating systems, CUDA versions, GPU architectures, precision modes, compiler requirements, and ONNX operator support for TensorRT-RTX 1.4.

Limitations#

Warning

  • Timing cache deprecated: Using ITimingCache forces a GPU query during AOT compilation and has no effect on the built engine. The APIs were deprecated in 1.2 and will be removed in a future update.

  • Engines are not forward-compatible: TensorRT-RTX engines must be rebuilt when upgrading the TensorRT-RTX runtime library.

  • Using a timing cache via ITimingCache and related APIs forces the ahead-of-time compilation step to query your system for a GPU (that is, prevents use of CPU-only AoT compilation). This has been true since the initial release (version 1.0). The timing cache has no effect on the built engine in TensorRT-RTX. We have deprecated the timing cache APIs in 1.2 and will remove them in a future update. They remain in place to avoid breaking source or binary compatibility. Applications should stop using timing cache APIs in preparation for their removal.

  • If the cache size grows too large (larger than 100MB), it may require more overhead to de/serialize to and from disk. If it negatively affects performance, delete the cache file and recreate one.

  • TensorRT-RTX engines are not forward-compatible with other versions of the TensorRT-RTX runtime. Ensure that any TensorRT-RTX engines you produce are run using the runtime from the same version of TensorRT-RTX which was used to generate the engine.

  • While TensorRT-RTX supports Turing (CUDA compute capability 7.5), a TensorRT-RTX engine created with default Compute Capability settings will produce an engine with support for Ampere and later GPUs, therefore excluding Turing. It is recommended to build a separate engine specifically for Turing to achieve the best performance. Creating a single engine with support for Turing and later GPUs will lead to less performant inference on the Ampere and later GPUs due to technical limitations of the engine format.

Deprecated API Lifetime#

  • APIs deprecated in TensorRT-RTX 1.4 will be retained until 3/2027.

  • APIs deprecated in TensorRT-RTX 1.3 will be retained until 12/2026.

  • APIs deprecated in TensorRT-RTX 1.2 will be retained until 10/2026.

  • APIs deprecated in TensorRT-RTX 1.1 will be retained until 8/2026.

  • APIs deprecated in TensorRT-RTX 1.0 will be retained until 6/2026.

Fixed Issues#

  • Fixed an intermittent issue in the TensorRT-RTX Flux.1 demo standalone Python script where errors could occur during runtime cache serialization when --dynamic-shape and --enable-runtime-cache options were used jointly.

  • Multi-head attention (MHA) kernels now correctly respect the shared memory limit when running in CiG mode.

  • Fixed a segmentation fault that could occur when running in CiG mode on Blackwell GPUs.

  • Fixed a crash that could occur when using the TensorRT-RTX ONNX Runtime Execution Provider with WebNN.

  • Enabled running multiple inference contexts in parallel with CUDA graph capture. Set a unique stream per context using the enqueueV3 API.

  • Fixed accuracy issues affecting certain audio and video processing models.

  • Fixed an erroneous CUDA error message that could be displayed when running in CPU-only mode without a GPU present.

  • Fixed incorrect Thread context not set error logging during IRuntime::getEngineValidity() calls.

  • Fixed an issue where some models running in Compute-in-Graphics (CiG) mode showed significantly reduced performance compared to non-CiG mode. This was caused by suboptimal kernel selection that prioritized compatibility with shared memory limitations over performance for certain model architectures.

Known Issues#

Functional

  • NonMaxSuppression, NonZero, and Multinomial layers are not supported.

  • Only the WDDM driver for Windows is supported. The TCC driver for Windows (refer to Tesla Compute Cluster (TCC)) is unsupported and may fail with the following error.

    [E] Error[1]: [defaultAllocator.cpp::nvinfer1::internal::DefaultAllocator::allocateAsync::48] Error Code 1: Cuda Runtime (operation not supported)
    

    For instructions on changing the driver mode, refer to the Nsight Visual Studio Edition documentation.

  • When using TensorRT-RTX with the PyCUDA library in Python, use import pycuda.autoprimaryctx instead of import pycuda.autoinit in order to avoid device conflicts.

  • Depthwise convolutions and deconvolutions for BF16 precision are not supported.

  • Convolutions and deconvolutions with both non-unit strides and dilations are not supported for all precisions. Non-unit strided convolutions and deconvolutions, and non-unit dilated convolutions and deconvolutions are supported.

  • Execution may fail for models that use a Convolution layer immediately followed by a GeLU operation. Affects all GPU architectures; a fix is planned for the next TensorRT-RTX update.

  • Running multiple inference contexts in parallel (for example, in a thread pool) may cause occasional crashes. Affects all GPU architectures; a fix is planned for the next TensorRT-RTX update.

  • Build may fail when a MatrixMultiply, Convolution, or Deconvolution op uses FP32 inputs and outputs and is followed by a cast to a lower-precision type (for example, FP16 or FP8). Affects all GPU architectures; a fix is expected in a future TensorRT-RTX update.

  • Engine context creation may fail with a MyelinCheckException for certain FP16 dynamic-shape ONNX models (for example, Stable Diffusion XL UNet and DaVinci Resolve SpeedWarp). The engine builds and deserializes successfully, but ICudaEngine::createExecutionContext throws an error during Myelin JIT initialization. This issue affects multiple GPU architectures and is not hardware-specific. A fix is expected in a future TensorRT-RTX update.

  • On Windows, the following symbols in tensorrt_rtx_1_4.dll should not be used and will be removed in the future:

    • ?disableInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Z

    • ?enableInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Z

    • ?getInternalBuildFlags@nvinfer1@@YA_KAEBVINetworkDefinition@1@@Z

    • ?setDebugOutput@nvinfer1@@YAXAEAVIExecutionContext@1@PEAVIDebugOutput@1@@Z

    • ?setInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Z

    • nvinfer1DisableInternalBuildFlags

    • nvinfer1EnableInternalBuildFlags

    • nvinfer1GetInternalBuildFlags

    • nvinfer1SetInternalBuildFlags

  • On Windows, runtime compilation may fail if the system temporary folder path is very long. If unexpected compilation errors occur on Windows, either shorten the TEMP environment variable (for example, C:\Temp) or enable long path support.

Performance

  • Use of the CPU-only Ahead-of-Time (AOT) feature can lead to reduced performance for some models. Affected applications will achieve the best performance if they instead perform AOT compilation on-device, targeted to the specific end-user machine. This can be done with the --useGPU flag for the tensorrt_rtx binary, or if using the APIs by setting the compute capabilities to contain only kCURRENT using IBuilderConfig::setComputeCapability(). You can measure performance with both approaches to determine the approach that is best for your application. We plan to resolve this performance discrepancy in a future release.

  • We have prioritized optimizing the performance for 16-bit floating point types, and such models will frequently achieve throughput using TensorRT-RTX that is very close to that achieved with TensorRT. Models that heavily use 32-bit floating point types will still see improvement, but performance will tend to not be as strong as that achieved using TensorRT. Expect performance across many models and data types to improve in future versions of TensorRT-RTX.

  • Convolutions and deconvolutions with large filter sizes may have degraded performance. We plan to improve performance on such cases in a future release.