Is this page helpful?

NVIDIA TensorRT for RTX Documentation#

NVIDIA TensorRT for RTX builds on the proven performance of the NVIDIA TensorRT inference library and simplifies the deployment of AI models on NVIDIA RTX GPUs across desktops, laptops, and workstations. It introduces a Just-In-Time (JIT) optimizer in the runtime that compiles optimized inference engines directly on the end-user’s RTX-accelerated PC.

The two-phase compilation process (AOT + JIT) typically completes in under 30 seconds total, eliminating lengthy per-device pre-compilation steps and enabling rapid engine generation with improved application portability. TensorRT for RTX is a compact, under 200 MB, drop-in replacement for NVIDIA TensorRT targeting NVIDIA RTX GPUs from NVIDIA Turing (compute capability 7.5) through NVIDIA Blackwell (compute capability 10.0) generations.

In the following documentation, TensorRT for RTX is referred to as TensorRT-RTX.

Quick Start#

🆕 New to NVIDIA TensorRT-RTX? → Deploy Your First Model walks you through building an engine from an ONNX model and running inference in minutes
⬆️ Upgrading from 1.4 or earlier? → Refer to What’s New in 1.5.
🔧 Need help with a specific task? → Jump to the Inference Library for API walkthroughs, dynamic shapes, CUDA graphs, and more

🆕 What’s New in NVIDIA TensorRT-RTX 1.5#

Release Highlights

DGX Spark / Linux SBSA (experimental) — New experimental build for NVIDIA DGX Spark (NVIDIA GB10, compute capability 12.1) and ARM64 Linux SBSA platforms (Ubuntu 22.04 / 24.04). Refer to Support Matrix and Prerequisites for the supported hardware and operating systems.
CUDA 13.3 Support — Compatible with NVIDIA CUDA 13.3, with continued support for CUDA 12.9 Update 1. Refer to Support Matrix for the full compiler and runtime version list.
Qwen3.5 Support — Qwen3.5 dense models are supported through Windows ML with the TensorRT-RTX execution provider. Refer to Architecture Overview for LLM and Windows ML integration details.
Operator Support — Added support for the RoiAlign ONNX operator. Refer to Operators for the full ONNX operator catalog.
GPU Latency Optimizations — Faster GEMV kernels for dynamic input shapes, reduced CPU overhead between kernel launches, expanded kernel fusion coverage for dynamic shapes, and improved just-in-time kernel generation for additional convolution variants and runtime fusion patterns. Convolution in FP16 precision with batch size 1 has been accelerated with a new backend on Ampere and later. Convolution performance was also improved on the NVIDIA GB10 (sm_121) architecture. Refer to GPU Latency Optimizations for the per-optimization list.
Stability and Accuracy Fixes — Resolved YOLO ONNX model builds on Turing (sm_75), fixed FP16 dynamic-shape execution-context errors affecting models such as Stable Diffusion XL UNet and DaVinci Resolve SpeedWarp, fixed a dynamic-shape accuracy regression, enabled BF16 depthwise convolutions and deconvolutions, and enabled 3D deconvolutions with groups and padding. Refer to Fixed Issues for per-issue details.

View 1.5 Release Notes

Previous Releases#

Note

For detailed changelogs, refer to the TensorRT-RTX GitHub Releases.