NVIDIA TensorRT for RTX Documentation#
NVIDIA TensorRT for RTX builds on the proven performance of the NVIDIA TensorRT inference library and simplifies the deployment of AI models on NVIDIA RTX GPUs across desktops, laptops, and workstations. It introduces a Just-In-Time (JIT) optimizer in the runtime that compiles optimized inference engines directly on the end-userβs RTX-accelerated PC.
The two-phase compilation process (AOT + JIT) typically completes in under 30 seconds total, eliminating lengthy per-device pre-compilation steps and enabling rapid engine generation with improved application portability. TensorRT for RTX is a compact, under 200 MB, drop-in replacement for NVIDIA TensorRT targeting NVIDIA RTX GPUs from NVIDIA Turing (compute capability 7.5) through NVIDIA Blackwell (compute capability 10.0) generations.
In the following documentation, TensorRT for RTX is referred to as TensorRT-RTX.
Quick Start#
π New to NVIDIA TensorRT-RTX? β Deploy Your First Model walks you through building an engine from an ONNX model and running inference in minutes
β¬οΈ Upgrading from 1.3 or earlier? β Refer to Whatβs New in 1.4 below
π§ Need help with a specific task? β Jump to the Inference Library for API walkthroughs, dynamic shapes, CUDA graphs, and more
π Whatβs New in TensorRT-RTX 1.4#
Release Highlights
CUDA 13.2 Support β Compatible with NVIDIA CUDA 13.2 Toolkit
PyPI Availability β Install TensorRT-RTX Python bindings directly from PyPI with
pip install tensorrt-rtxAPI Capture and Replay β New debugging feature that records TensorRT-RTX API calls during engine building and replays them for issue reproduction without requiring the original application or model source code. See API Capture and Replay documentation. (Linux only)
GPU Latency Optimizations β Improved performance with optimized 1D convolution kernels, optimized GEMV kernels, new Windows backend for batch size = 1 convolutions, improved JIT compilation heuristics, faster JIT compilation, and enhanced multi-head attention (MHA) performance
Parallel CUDA Graph Capture β Enabled running multiple inference contexts in parallel with CUDA graph capture using unique streams per context
Compute-in-Graphics (CiG) Improvements β Fixed performance issues and segmentation faults on Blackwell GPUs, improved MHA kernel shared memory handling
Previous Releases#
π Release 1.3 Highlights
Thread-Safe Multi-GPU Execution β Enabled thread-safe execution for multiple GPUs with different compute capabilities, up to one network per thread
LLM and Convolution Performance β Improved throughput for LLMs with short prompt lengths, INT8 weight-only-quantization for
IMatrixMultiplyLayers, and additional kernel fusion patterns for convolution-based modelsBlackwell CUDA Graphics Mode β Supports CUDA contexts created in NVIDIA CUDA graphics mode on Blackwell devices
FP8 Performance β Improved performance for many FP8 models on Blackwell
CUDA 12.9 / 13.1 Support β Compatible with NVIDIA CUDA 12.9 and CUDA 13.1
π¦ Archived Releases
Earlier TensorRT-RTX 1.x releases with key highlights:
1.2 Release Notes - CUDA Graphs Support, User Memory Allocation, CUDA 13.0 Support, Library Reorganization
1.1 Release Notes - Engine Validity API, Faster Compilation
1.0 Release Notes - Reduced Binary Size, Two-Phase Compilation, System Resource Adaptivity, Windows ML Support
Note
For detailed changelogs, refer to the TensorRT-RTX GitHub Releases.