TensorRT-RTX 1.0 Release Notes#

These release notes describe NVIDIA TensorRT-RTX 1.0 and apply to x86-64 Linux and Windows users. Use them to identify the supported software components and versions in this release, review compatibility and support information, and locate the per-component release notes for new features, fixed issues, and known limitations.

Release Highlights

  • Reduced Binary Size — Smaller download size and disk footprint for improved deployment in consumer applications.

  • Two-Phase Compilation — Hardware-agnostic ahead-of-time (AOT) and hardware-specific just-in-time (JIT) optimization phases for improved user experience.

  • System Resource Adaptivity — Improved adaptivity to real-system resources for background AI features.

  • Windows ML Support — Native acceleration support for Windows ML.

Key Features and Enhancements#

Compact Footprint

This release introduces a smaller binary footprint:

  • Reduced binary size for improved download speed and disk footprint when included in consumer applications.

Two-Phase Compilation

This release introduces a two-phase compilation model:

  • Optimization is split into a hardware-agnostic ahead-of-time (AOT) phase and a hardware-specific just-in-time (JIT) phase to improve the user experience.

  • Focused improvement on portability and deployment while still delivering industry-leading performance.

System Resource Adaptivity

This release improves runtime responsiveness for background AI workloads:

  • Adaptivity to real-system resources has been improved for applications where AI features run in the background.

Platform Support

This release expands platform availability:

  • Added native acceleration support for Windows ML.

Compatibility#

For comprehensive platform compatibility, hardware requirements, and feature availability information, refer to the TensorRT-RTX Support Matrix. The support matrix provides detailed information about supported operating systems, CUDA versions, GPU architectures, precision modes, compiler requirements, and ONNX operator support for TensorRT-RTX 1.0.

Limitations#

Warning

  • LLMs not yet optimized: The 1.0 release has not been optimized for LLMs due to lack of INT4 quantization support. This will be addressed in a subsequent release.

  • Engines are not forward-compatible: TensorRT-RTX engines must be rebuilt when upgrading the TensorRT-RTX runtime library.

  • If the cache size grows too large (larger than 100MB), it may require more overhead to de/serialize to and from disk. If it negatively affects performance, delete the cache file and recreate one.

  • The 1.0 version of TensorRT-RTX has not been optimized for LLMs because of a lack of INT4 quantization support. This will be addressed in a subsequent release.

  • TensorRT-RTX engines are not forward-compatible with other versions of the TensorRT-RTX runtime. Ensure that any TensorRT-RTX engines you produce are run using the runtime from the same version of TensorRT-RTX which was used to generate the engine.

  • While TensorRT-RTX supports Turing (CUDA compute capability 7.5), a TensorRT-RTX engine created with default Compute Capability settings will produce an engine with support for Ampere and later GPUs, therefore excluding Turing. It is recommended to build a separate engine specifically for Turing to achieve the best performance. Creating a single engine with support for Turing and later GPUs will lead to less performant inference on the Ampere and later GPUs due to technical limitations of the engine format.

Deprecated and Removed Features#

The following features have been deprecated or removed in TensorRT-RTX 1.0:

  • Weakly-typed networks, a feature of TensorRT, are not supported in TensorRT-RTX. However, the APIs for weakly-typed networks are included in TensorRT-RTX but marked as deprecated in order to simplify porting existing code. These APIs do nothing in TensorRT-RTX and will in some cases cause warnings to be logged to ILogger. For example, deprecated methods include ITensor::setType(), ILayer::setPrecision(), and ILayer::precisionIsSet().

  • The HardwareCompatibilityLevel builder config is not supported in TensorRT-RTX. By default, a TensorRT-RTX engine can run on any NVIDIA Ampere and later GPU.

Known Issues#

Functional

  • NonMaxSuppression, NonZero, and Multinomial layers are not supported.

  • When the builder flag kREFIT or kREFIT_IDENTICAL is set, TensorRT-RTX may fail to build a network containing normalization layers.

  • When building an engine with compute capability SM75, MatrixMultiply layers with batch size greater than 1 might fail to execute due to an alignment error.

  • Only the WDDM driver for Windows is supported. The TCC driver for Windows (refer to Tesla Compute Cluster (TCC)) is unsupported and may fail with the following error.

    [E] Error[1]: [defaultAllocator.cpp::nvinfer1::internal::DefaultAllocator::allocateAsync::48] Error Code 1: Cuda Runtime (operation not supported)
    

    For instructions on changing the driver mode, refer to the Nsight Visual Studio Edition documentation.

  • When using TensorRT-RTX with the PyCUDA library in Python, use import pycuda.autoprimaryctx instead of import pycuda.autoinit in order to avoid device conflicts.

  • A model containing hardware-specific data types may be compiled successfully in the AOT build phase with the ComputeCapability flag but still fail to run on devices with the specified compute capability. For example, an FP8 model may be successfully compiled with ComputeCapability::kSM75 but will still fail to run because Turing GPUs do not support FP8.

Performance

  • Use of the CPU-only Ahead-of-Time (AOT) feature can lead to reduced performance for some models, particularly those with multi-head attention (MHA) due to CPU-only AOT’s use of conservative shared memory limits. Affected applications will achieve the best performance if they instead perform AOT compilation on-device, targeted to the specific end-user machine. This can be done with the --useGPU flag for the tensorrt_rtx binary, or if using the APIs by setting the compute capabilities to contain only kCURRENT using IBuilderConfig::setComputeCapability(). You can measure performance with both approaches to determine the approach that is best for your application. We plan to resolve this performance discrepancy in a future release.

  • We have prioritized optimizing the performance for 16-bit floating point types, and such models will frequently achieve throughput using TensorRT-RTX that is very close to that achieved with TensorRT. Models that heavily use 32-bit floating point types will still see improvement, but performance will tend to not be as strong as that achieved using TensorRT. Expect performance across many models and data types to improve in future versions of TensorRT-RTX.