NVIDIA TensorRT for RTX Documentation#
NVIDIA TensorRT for RTX builds on the proven performance of the NVIDIA TensorRT inference library and simplifies the deployment of AI models on NVIDIA RTX GPUs across desktops, laptops, and workstations. It introduces a Just-In-Time (JIT) optimizer in the runtime that compiles optimized inference engines directly on the end-userβs RTX-accelerated PC.
The two-phase compilation process (AOT + JIT) typically completes in under 30 seconds total, eliminating lengthy per-device pre-compilation steps and enabling rapid engine generation with improved application portability. TensorRT for RTX is a compact, under 200 MB, drop-in replacement for NVIDIA TensorRT targeting NVIDIA RTX GPUs from NVIDIA Turing (compute capability 7.5) through NVIDIA Blackwell (compute capability 10.0) generations.
In the following documentation, TensorRT for RTX is referred to as TensorRT-RTX.
Quick Start#
π New to NVIDIA TensorRT-RTX? β Deploy Your First Model walks you through building an engine from an ONNX model and running inference in minutes
β¬οΈ Upgrading from 1.4 or earlier? β Refer to Whatβs New in 1.5.
π§ Need help with a specific task? β Jump to the Inference Library for API walkthroughs, dynamic shapes, CUDA graphs, and more
π Whatβs New in NVIDIA TensorRT-RTX 1.5#
Release Highlights
DGX Spark / Linux SBSA (experimental) β New experimental build for NVIDIA DGX Spark (NVIDIA GB10, compute capability 12.1) and ARM64 Linux SBSA platforms (Ubuntu 22.04 / 24.04). Refer to Support Matrix and Prerequisites for the supported hardware and operating systems.
CUDA 13.3 Support β Compatible with NVIDIA CUDA 13.3, with continued support for CUDA 12.9 Update 1. Refer to Support Matrix for the full compiler and runtime version list.
Qwen3.5 Support β Qwen3.5 dense models are supported through Windows ML with the TensorRT-RTX execution provider. Refer to Architecture Overview for LLM and Windows ML integration details.
Operator Support β Added support for the
RoiAlignONNX operator. Refer to Operators for the full ONNX operator catalog.GPU Latency Optimizations β Faster GEMV kernels for dynamic input shapes, reduced CPU overhead between kernel launches, expanded kernel fusion coverage for dynamic shapes, and improved just-in-time kernel generation for additional convolution variants and runtime fusion patterns. Convolution in FP16 precision with batch size 1 has been accelerated with a new backend on Ampere and later. Convolution performance was also improved on the NVIDIA GB10 (
sm_121) architecture. Refer to GPU Latency Optimizations for the per-optimization list.Stability and Accuracy Fixes β Resolved YOLO ONNX model builds on Turing (
sm_75), fixed FP16 dynamic-shape execution-context errors affecting models such as Stable Diffusion XL UNet and DaVinci Resolve SpeedWarp, fixed a dynamic-shape accuracy regression, enabled BF16 depthwise convolutions and deconvolutions, and enabled 3D deconvolutions with groups and padding. Refer to Fixed Issues for per-issue details.
Previous Releases#
π Release 1.4 Highlights
CUDA 13.2 Support β Compatible with NVIDIA CUDA 13.2 Toolkit
PyPI Availability β Install TensorRT-RTX Python bindings directly from PyPI with
pip install tensorrt-rtxAPI Capture and Replay β New debugging feature that records TensorRT-RTX API calls during engine building and replays them for issue reproduction without requiring the original application or model source code. Refer to API Capture and Replay documentation. (Linux only)
GPU Latency Optimizations β Improved performance with optimized 1D convolution kernels, optimized GEMV kernels, new Windows backend for batch size = 1 convolutions, improved JIT compilation heuristics, faster JIT compilation, and enhanced multi-head attention (MHA) performance
Parallel CUDA Graph Capture β Enabled running multiple inference contexts in parallel with CUDA graph capture using unique streams per context
Compute-in-Graphics (CiG) Improvements β Fixed performance issues and segmentation faults on Blackwell GPUs, improved MHA kernel shared memory handling
π¦ Archived Releases
Earlier TensorRT-RTX 1.x releases with key highlights:
1.3 Release Notes - Thread-Safe Multi-GPU Execution, LLM and Convolution Performance, Blackwell CUDA Graphics Mode, FP8 Performance, CUDA 12.9 / 13.1 Support
1.2 Release Notes - CUDA Graphs Support, User Memory Allocation, CUDA 13.0 Support, Library Reorganization
1.1 Release Notes - Engine Validity API, Faster Compilation
1.0 Release Notes - Reduced Binary Size, Two-Phase Compilation, System Resource Adaptivity, Windows ML Support
Note
For detailed changelogs, refer to the TensorRT-RTX GitHub Releases.