NVIDIA TensorRT for RTX Documentation#
NVIDIA TensorRT for RTX builds on the proven performance of the NVIDIA TensorRT inference library and simplifies the deployment of AI models on NVIDIA RTX GPUs across desktops, laptops, and workstations. It introduces a Just-In-Time (JIT) optimizer in the runtime that compiles optimized inference engines directly on the end-userβs RTX-accelerated PC.
The two-phase compilation process (AOT + JIT) typically completes in under 30 seconds total, eliminating lengthy per-device pre-compilation steps and enabling rapid engine generation with improved application portability. TensorRT for RTX is a compact, under 200 MB, drop-in replacement for NVIDIA TensorRT targeting NVIDIA RTX GPUs from NVIDIA Turing (compute capability 7.5) through NVIDIA Blackwell (compute capability 10.0) generations.
In the following documentation, TensorRT for RTX is referred to as TensorRT-RTX.
π Whatβs New in TensorRT-RTX 1.4#
Latest Release Highlights
CUDA 13.2 Support β Compatible with NVIDIA CUDA 13.2 Toolkit
API Capture and Replay β New debugging feature that records TensorRT-RTX API calls during engine building and replays them for issue reproduction without requiring the original application or model source code. See API Capture and Replay documentation. (Linux only)
GPU Latency Optimizations β Improved performance with optimized 1D convolution kernels, optimized GEMV kernels, new Windows backend for batch size = 1 convolutions, improved JIT compilation heuristics, faster JIT compilation, and enhanced multi-head attention (MHA) performance
Parallel CUDA Graph Capture β Enabled running multiple inference contexts in parallel with CUDA graph capture using unique streams per context
Compute-in-Graphics (CiG) Improvements β Fixed performance issues and segmentation faults on Blackwell GPUs, improved MHA kernel shared memory handling
What Youβll Find Here#
π Getting Started β Release notes and platform support matrix
π¦ Installing TensorRT-RTX β Prerequisites, step-by-step setup for Windows and Linux, first model deployment, and ONNX conversion guide
ποΈ Architecture β Performance benchmarks, two-phase compilation pipeline, model specification paths, and relation to other TensorRT ecosystem libraries
π§ Inference Library β Native runtime API tutorial, C++ and Python APIs, dynamic shapes, runtime cache, CUDA graphs, compute-in-graphics, CPU engines, and porting from TensorRT
β‘ Performance β Best practices for optimization and using
tensorrt_rtxfor benchmarkingπ API β Complete C++ and Python API references
π Reference β Operator support, deprecation policy, cybersecurity disclosures, and SLA
To learn more about TensorRT for RTXβs C++ and Python API usage, refer to our GitHub: NVIDIA TensorRT-RTX project.
Previous Releases#
π Release 1.3 Highlights
Thread-Safe Multi-GPU Execution β Enabled thread-safe execution for multiple GPUs with different compute capabilities, up to one network per thread
LLM and Convolution Performance β Improved throughput for LLMs with short prompt lengths, INT8 weight-only-quantization for
IMatrixMultiplyLayers, and additional kernel fusion patterns for convolution-based modelsBlackwell CUDA Graphics Mode β Supports CUDA contexts created in NVIDIA CUDA graphics mode on Blackwell devices
FP8 Performance β Improved performance for many FP8 models on Blackwell
CUDA 12.9 / 13.1 Support β Compatible with NVIDIA CUDA 12.9 and CUDA 13.1
π Release 1.2 Highlights
CUDA Graphs Support β Built-in CUDA Graphs with automatic dynamic shape support, enabling one-line changes to accelerate inference workflows by reducing GPU kernel launch overhead
User Memory Allocation β New
kREQUIRE_USER_ALLOCATIONbuilder flag andIExecutionContext::isStreamCapturable()API for CUDA stream capture workflowsCUDA 13.0 Support β Compatible with NVIDIA CUDA 13.0 Toolkit
Library Reorganization β DLL libraries moved from
libtobinsubdirectory
π Release 1.1 Highlights
Engine Validity API β Added
IRuntime::getEngineValidity()API to programmatically check engine file compatibility without loading the entire file into memoryFaster Compilation β Greatly improved compilation time, with an average 1.5x improvement across a variety of model architectures, particularly for models with many memory-bound kernels
π Release 1.0 Highlights
Reduced Binary Size β Smaller download size and disk footprint for improved deployment in consumer applications
Two-Phase Compilation β Hardware-agnostic ahead-of-time (AOT) and hardware-specific just-in-time (JIT) optimization phases for improved user experience
System Resource Adaptivity β Improved adaptivity to real-system resources for background AI features
Windows ML Support β Native acceleration support for Windows ML
Note
For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT-RTX GitHub Releases.