TensorRT-RTX 1.1 Release Notes#

These release notes describe NVIDIA TensorRT-RTX 1.1 and apply to x86-64 Linux and Windows users. Use them to identify the supported software components and versions in this release, review compatibility and support information, and locate the per-component release notes for new features, fixed issues, and known limitations.

Release Highlights

  • Engine Validity API — Added IRuntime::getEngineValidity() API to programmatically check engine file compatibility without loading the entire file into memory.

  • Faster Compilation — Compilation time has been greatly improved, with an average 1.5x improvement across a variety of model architectures, particularly for models with many memory-bound kernels.

Key Features and Enhancements#

API Additions

This release introduces a new API for engine validation:

  • Added the IRuntime::getEngineValidity() API to programmatically check whether a TensorRT-RTX engine file is valid on the current system or needs to be rebuilt due to incompatibilities such as software version or compute capability. This API only checks the file header and therefore does not require loading the entire engine file into memory, which results in a more efficient check.

Compilation Performance

This release improves engine build time:

  • Compilation time has been greatly improved, particularly for models with many memory-bound kernels. On average, a 1.5x improvement is observed across a variety of model architectures.

Compatibility#

For comprehensive platform compatibility, hardware requirements, and feature availability information, refer to the TensorRT-RTX Support Matrix. The support matrix provides detailed information about supported operating systems, CUDA versions, GPU architectures, precision modes, compiler requirements, and ONNX operator support for TensorRT-RTX 1.1.

Limitations#

Warning

  • Timing cache deprecation planned: Using ITimingCache forces a GPU query during AOT compilation and has no effect on the built engine. The APIs will be deprecated in the next version of TensorRT-RTX and removed in a future update.

  • Engines are not forward-compatible: TensorRT-RTX engines must be rebuilt when upgrading the TensorRT-RTX runtime library.

  • Using a timing cache via ITimingCache and related APIs forces the ahead-of-time compilation step to query your system for a GPU (that is, prevents use of CPU-only AoT compilation). This was true in version 1.0 as well. The timing cache has no effect on the built engine in TensorRT-RTX. We intend to deprecate the timing cache APIs in the next release and then remove them in a future update. They remain in place to avoid breaking source or binary compatibility. Applications should start removing usage of timing cache APIs in preparation for their removal.

  • If the cache size grows too large (larger than 100MB), it may require more overhead to de/serialize to and from disk. If it negatively affects performance, delete the cache file and recreate one.

  • TensorRT-RTX engines are not forward-compatible with other versions of the TensorRT-RTX runtime. Ensure that any TensorRT-RTX engines you produce are run using the runtime from the same version of TensorRT-RTX which was used to generate the engine.

  • While TensorRT-RTX supports Turing (CUDA compute capability 7.5), a TensorRT-RTX engine created with default Compute Capability settings will produce an engine with support for Ampere and later GPUs, therefore excluding Turing. It is recommended to build a separate engine specifically for Turing to achieve the best performance. Creating a single engine with support for Turing and later GPUs will lead to less performant inference on the Ampere and later GPUs due to technical limitations of the engine format.

Deprecated and Removed Features#

The following features have been deprecated or removed in TensorRT-RTX 1.1:

  • ITimingCache APIs will be deprecated in the next version of TensorRT-RTX. Refer to the Limitations section for more information.

Fixed Issues#

  • Fixed a crash when parsing custom operators from ONNX. Custom operators are not supported, but the software now reports an error instead of crashing.

  • Fixed an engine build failure caused by the kREFIT or kREFIT_IDENTICAL builder flags when normalization layers exist in the network definition.

  • Fixed an execution failure for MatrixMultiply layers with batch size greater than 1 when building an engine with compute capability SM75.

  • Fixed an AOT build failure for models containing hardware-specific data types when those data types are not supported on the target compute capabilities specified through the TensorRT-RTX API.

  • Fixed a JIT build failure for specific dynamic shape cases with very large filter sizes and custom paddings.

  • Fixed a segmentation fault that could occur in dynamic shape networks containing if/else conditional blocks that branch based on tensor shapes.

  • Fixed segmentation faults caused by repeated serialization and deserialization of engines.

  • Fixed a performance regression where specifying kCurrent compute capability on some platforms lagged behind the default compute capability.

  • Fixed an issue where creating an execution context would create and immediately destroy many threads.

Known Issues#

Functional

  • NonMaxSuppression, NonZero, and Multinomial layers are not supported.

  • Only the WDDM driver for Windows is supported. The TCC driver for Windows (refer to Tesla Compute Cluster (TCC)) is unsupported and may fail with the following error.

    [E] Error[1]: [defaultAllocator.cpp::nvinfer1::internal::DefaultAllocator::allocateAsync::48] Error Code 1: Cuda Runtime (operation not supported)
    

    For instructions on changing the driver mode, refer to the Nsight Visual Studio Edition documentation.

  • When using TensorRT-RTX with the PyCUDA library in Python, use import pycuda.autoprimaryctx instead of import pycuda.autoinit in order to avoid device conflicts.

  • Depthwise convolutions and deconvolutions for BF16 precision are not supported.

  • Convolutions and deconvolutions with both non-unit strides and dilations are not supported for all precisions. Non-unit strided convolutions and deconvolutions, and non-unit dilated convolutions and deconvolutions are supported.

Performance

  • Use of the CPU-only Ahead-of-Time (AOT) feature can lead to reduced performance for some models, particularly those with multi-head attention (MHA) due to CPU-only AOT’s use of conservative shared memory limits. Affected applications will achieve the best performance if they instead perform AOT compilation on-device, targeted to the specific end-user machine. This can be done with the --useGPU flag for the tensorrt_rtx binary, or if using the APIs by setting the compute capabilities to contain only kCURRENT using IBuilderConfig::setComputeCapability(). You can measure performance with both approaches to determine the approach that is best for your application. We plan to resolve this performance discrepancy in a future release.

  • We have prioritized optimizing the performance for 16-bit floating point types, and such models will frequently achieve throughput using TensorRT-RTX that is very close to that achieved with TensorRT. Models that heavily use 32-bit floating point types will still see improvement, but performance will tend to not be as strong as that achieved using TensorRT. Expect performance across many models and data types to improve in future versions of TensorRT-RTX.

  • Background kernel compilations, triggered as part of the dynamic shapes specialization strategy, are opportunistic and are not guaranteed to complete before network execution finishes. If your dynamic-shapes workload contains a fixed set of shapes, consider using the eager specialization strategy along with the runtime cache to load and store kernels quickly for best performance.

  • When running in CiG mode, some models show significantly reduced performance compared to non-CiG mode due to suboptimal kernels that are compatible with the shared memory limitations.

  • Some models show increased VRAM usage compared to running inference for the same model in TensorRT 10.11.

  • Convolutions and deconvolutions with large filter sizes may have degraded performance.