TensorRT-RTX 1.5 Release Notes#

These Release Notes apply to x86 Linux and Windows users. This release includes several fixes from the previous TensorRT-RTX releases and additional changes.

Key Features and Enhancements#

Operator Support

This release adds support for additional ONNX operators:

  • TensorRT-RTX now supports the RoiAlign ONNX operator.

Platform Support

This release expands platform availability:

  • An experimental build for DGX Spark and Linux SBSA is now available.

GPU Latency Optimizations

This release improves GPU latency with the following optimizations:

  • GEMV kernel performance has been improved for models compiled with dynamic input shapes.

  • Convolution performance has been improved on the sm_121 architecture (NVIDIA GB10 / CUDA compute capability 12.1).

  • CPU overhead between kernel launches has been reduced.

  • Kernel fusion coverage has been expanded for models compiled with dynamic input shapes.

  • Just-in-time kernel generation has been improved, expanding coverage for additional convolution variants and runtime fusion patterns.

Breaking ABI and API Changes#

Warning

TensorRT-RTX 1.5 is not ABI-compatible with TensorRT-RTX 1.4. To upgrade, you must update your headers and rebuild your application against the 1.5 SDK.

The following changes apply to this release. ABI Changes affect the binary interface and require an application rebuild against the 1.5 SDK; API Changes break source-level compatibility and require code modifications before rebuilding.

ABI Changes#

These changes affect the binary interface. Source code typically compiles unchanged after picking up the new headers, but applications must be rebuilt and re-linked against the 1.5 SDK.

Removed Destructor and Constructor Symbols

The following types no longer export destructor or constructor symbols:

  • nvinfer1::ILogger

  • nvinfer1::ILoggerFinder

  • nvinfer1::IRuntimeCache

  • nvinfer1::ITimingCache

  • nvinfer1::IProfiler

API Changes#

These changes affect source-level compatibility. Applications using the affected APIs must update their source code before rebuilding against the 1.5 SDK.

Return Type Changes

  • nvinfer1::IBuilderConfig::setMaxAuxStreams(int32_t) — return type changed from void to bool.

Removed Previously Deprecated APIs

The following APIs were previously deprecated and have been removed in this release:

  • nvinfer1::NetworkDefinitionCreationFlag

    • kEXPLICIT_BATCH

  • nvinfer1::IRuntime:

    • deserializeCudaEngine(IStreamReader&)

  • nvinfer1::IRefitter:

    • setDynamicRange(char const*, float, float)

    • getDynamicRangeMin(char const*)

    • getDynamicRangeMax(char const*)

    • getTensorsWithDynamicRange(int32_t, char const**)

  • nvinfer1::IOptimizationProfile:

    • setShapeValues(char const*, OptProfileSelector, int32_t const*, int32_t)

    • getShapeValues(char const*, OptProfileSelector)

  • nvinfer1::ICudaEngine:

    • getProfileTensorValues(char const*, int32_t, OptProfileSelector)

    • getDeviceMemorySizeForProfile(int32_t)

    • setWeightStreamingBudget(int64_t)

    • getWeightStreamingBudget()

    • getMinimumWeightStreamingBudget()

    • hasImplicitBatchDimension()

  • nvinfer1::IExecutionContext:

    • allInputShapesSpecified()

  • nvinfer1::INetworkDefinition:

    • addFill(Dims const&, FillOperation)

  • nvinfer1::IBuilder:

    • platformHasFastFp16()

    • platformHasFastInt8()

    • platformHasTf32()

  • nvinfer1::plugin:

    • CodeTypeSSD (class removed)

  • nvinfer1::IPluginRegistry:

    • registerCreator(IPluginCreator&, AsciiChar const* const)

    • getPluginCreatorList(int32_t* const)

    • getPluginCreator(AsciiChar const* const, AsciiChar const* const, AsciiChar const* const)

    • deregisterCreator(IPluginCreator const&)

  • nvinfer1::IPluginV2Ext:

    • isOutputBroadcastAcrossBatch(int32_t, bool const*, int32_t)

    • canBroadcastInputAcrossBatch(int32_t)

  • nvonnxparser::IParser:

    • supportsModel(void const*, size_t, SubGraphCollection_t&, char const*)

    • parseWithWeightDescriptors(void const*, size_t)

Compatibility#

For comprehensive platform compatibility, hardware requirements, and feature availability information, refer to the TensorRT-RTX Support Matrix. The support matrix provides detailed information about supported operating systems, CUDA versions, GPU architectures, precision modes, compiler requirements, and ONNX operator support for TensorRT-RTX 1.5.

Limitations#

Warning

  • Timing cache deprecated: Using ITimingCache forces a GPU query during ahead-of-time compilation and has no effect on the built engine. The APIs were deprecated in 1.2 and will be removed in a future update.

  • Engines are not forward-compatible: TensorRT-RTX engines must be rebuilt when upgrading the TensorRT-RTX runtime library.

  • Using a timing cache through ITimingCache and related APIs forces the ahead-of-time compilation step to query your system for a GPU (that is, prevents use of CPU-only ahead-of-time compilation). This has been true since the initial release (version 1.0). The timing cache has no effect on the built engine in TensorRT-RTX. The timing cache APIs were deprecated in 1.2 and will be removed in a future update. They remain in place to avoid breaking source or binary compatibility. Applications should stop using timing cache APIs in preparation for their removal.

  • If the cache size grows too large (larger than 100 MB), it might require more overhead to serialize and deserialize from disk. If it negatively affects performance, delete the cache file and recreate one.

  • TensorRT-RTX engines are not forward-compatible with other versions of the TensorRT-RTX runtime. Ensure that any TensorRT-RTX engines you produce are run using the runtime from the same version of TensorRT-RTX that was used to generate the engine.

  • While TensorRT-RTX supports Turing (CUDA compute capability 7.5), a TensorRT-RTX engine created with default Compute Capability settings will produce an engine with support for Ampere and later GPUs, therefore excluding Turing. It is recommended to build a separate engine specifically for Turing to achieve the best performance. Creating a single engine with support for Turing and later GPUs will lead to less performant inference on the Ampere and later GPUs due to technical limitations of the engine format.

Deprecated and Removed Features#

Several previously deprecated APIs have been removed in this release. Refer to API Changes for the complete symbol-level list.

Fixed Issues#

  • Fixed an internal compiler failure that prevented YOLO ONNX models (for example, YOLOv26m and YOLOv26m-seg) from building on Turing (sm_75) GPUs. (Reported in GitHub issue #33.)

  • Fixed an internal builder error reporting target_sm '101' does not have arch kind assigned that could occur when targeting newer compute capabilities.

  • Fixed an execution failure for models that use a Convolution layer immediately followed by a GeLU operation. The issue affected all GPU architectures.

  • Fixed a MyelinCheckException thrown by ICudaEngine::createExecutionContext during JIT initialization for certain FP16 dynamic-shape ONNX models (for example, Stable Diffusion XL UNet and DaVinci Resolve SpeedWarp). The engine built and deserialized successfully, but execution-context creation failed; this affected multiple GPU architectures and was not hardware-specific.

  • Fixed an accuracy regression affecting some models built with dynamic input shapes, caused by eng8 matmul kernels not reloading correctly from the kernel cache. These kernels are now excluded as candidates when dynamic shapes are in use.

  • Enabled depthwise convolutions and deconvolutions in BF16 precision, removing a known limitation from TensorRT-RTX 1.4.

  • Enabled 3D deconvolutions with groups and padding, removing a known limitation from TensorRT-RTX 1.4.

  • Fixed a MyelinCheckException thrown by ICudaEngine::createExecutionContext during JIT initialization for certain models containing static FP8 convolutions.

  • Fixed a performance regression where specialized (tuned) cuDNN kernels were not selected from the kernel cache after an IExecutionContext was destroyed and recreated against the same engine. This affected applications that hold a persistent engine while creating short-lived execution contexts (for example, per-request inference servers).

  • Fixed a performance regression affecting certain 1D convolution layers with large kernel sizes.

  • Fixed an issue on Windows where runtime compilation could occasionally fail when the system temporary folder path exceeded the MAX_PATH limit (260 characters).

  • Fixed multi-threaded compilation crashes that could occur when building models containing convolutions.

Known Issues#

Functional

  • NonMaxSuppression, NonZero, and Multinomial layers are not supported.

  • Only the WDDM driver for Windows is supported. The TCC driver for Windows (refer to Tesla Compute Cluster (TCC)) is unsupported and may fail with the following error.

    [E] Error[1]: [defaultAllocator.cpp::nvinfer1::internal::DefaultAllocator::allocateAsync::48] Error Code 1: Cuda Runtime (operation not supported)
    

    For instructions on changing the driver mode, refer to the Nsight Visual Studio Edition documentation.

  • When using TensorRT-RTX with the PyCUDA library in Python, use import pycuda.autoprimaryctx instead of import pycuda.autoinit to avoid device conflicts.

  • Convolutions and deconvolutions with both non-unit strides and dilations are not supported for all precisions. Non-unit strided convolutions and deconvolutions, and non-unit dilated convolutions and deconvolutions are supported.

  • Build may fail when a MatrixMultiply, Convolution, or Deconvolution op uses FP32 inputs and outputs and is followed by a cast to a lower-precision type (for example, FP16 or FP8). Affects all GPU architectures; a fix is expected in a future TensorRT-RTX update.

  • On Windows, the following symbols in tensorrt_rtx_1_5.dll should not be used and will be removed in the future:

    • ?disableInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Z

    • ?enableInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Z

    • ?getInternalBuildFlags@nvinfer1@@YA_KAEBVINetworkDefinition@1@@Z

    • ?setDebugOutput@nvinfer1@@YAXAEAVIExecutionContext@1@PEAVIDebugOutput@1@@Z

    • ?setInternalBuildFlags@nvinfer1@@YAXAEAVINetworkDefinition@1@_K@Z

    • nvinfer1DisableInternalBuildFlags

    • nvinfer1EnableInternalBuildFlags

    • nvinfer1GetInternalBuildFlags

    • nvinfer1SetInternalBuildFlags

Performance

  • Use of the CPU-only ahead-of-time (AOT) feature can lead to reduced performance for some models. Affected applications achieve the best performance if they instead perform ahead-of-time compilation on-device, targeted to the specific end-user machine. This can be done with the --useGPU flag for the tensorrt_rtx binary, or when using the APIs by setting the compute capabilities to contain only kCURRENT using IBuilderConfig::setComputeCapability(). Measure performance with both approaches to determine which is best for your application. This performance discrepancy is targeted for resolution in a future release.

  • Performance optimization in this release prioritizes 16-bit floating point types; such models frequently achieve throughput using TensorRT-RTX that is very close to that achieved with TensorRT. Models that heavily use 32-bit floating point types still see improvement, but performance tends to be lower than that achieved using TensorRT. Performance across additional models and data types is expected to improve in future versions of TensorRT-RTX.

  • Convolutions and deconvolutions with large filter sizes may have degraded performance. Performance improvements for these cases are planned for a future release.