NVIDIA TensorRT for RTX Documentation#

NVIDIA TensorRT for RTX builds on the proven performance of the NVIDIA TensorRT inference library and simplifies the deployment of AI models on NVIDIA RTX GPUs across desktops, laptops, and workstations. It introduces a Just-In-Time (JIT) optimizer in the runtime that compiles optimized inference engines directly on the end-user’s RTX-accelerated PC.

The two-phase compilation process (AOT + JIT) typically completes in under 30 seconds total, eliminating lengthy per-device pre-compilation steps and enabling rapid engine generation with improved application portability. TensorRT for RTX is a compact, under 200 MB, drop-in replacement for NVIDIA TensorRT targeting NVIDIA RTX GPUs from NVIDIA Turing (compute capability 7.5) through NVIDIA Blackwell (compute capability 10.0) generations.

In the following documentation, TensorRT for RTX is referred to as TensorRT-RTX.

πŸ†• What’s New in TensorRT-RTX 1.4#

Latest Release Highlights

  • CUDA 13.2 Support β€” Compatible with NVIDIA CUDA 13.2 Toolkit

  • API Capture and Replay β€” New debugging feature that records TensorRT-RTX API calls during engine building and replays them for issue reproduction without requiring the original application or model source code. See API Capture and Replay documentation. (Linux only)

  • GPU Latency Optimizations β€” Improved performance with optimized 1D convolution kernels, optimized GEMV kernels, new Windows backend for batch size = 1 convolutions, improved JIT compilation heuristics, faster JIT compilation, and enhanced multi-head attention (MHA) performance

  • Parallel CUDA Graph Capture β€” Enabled running multiple inference contexts in parallel with CUDA graph capture using unique streams per context

  • Compute-in-Graphics (CiG) Improvements β€” Fixed performance issues and segmentation faults on Blackwell GPUs, improved MHA kernel shared memory handling

View 1.4 Release Notes

What You’ll Find Here#

  • πŸ“‹ Getting Started β€” Release notes and platform support matrix

  • πŸ“¦ Installing TensorRT-RTX β€” Prerequisites, step-by-step setup for Windows and Linux, first model deployment, and ONNX conversion guide

  • πŸ—οΈ Architecture β€” Performance benchmarks, two-phase compilation pipeline, model specification paths, and relation to other TensorRT ecosystem libraries

  • πŸ”§ Inference Library β€” Native runtime API tutorial, C++ and Python APIs, dynamic shapes, runtime cache, CUDA graphs, compute-in-graphics, CPU engines, and porting from TensorRT

  • ⚑ Performance β€” Best practices for optimization and using tensorrt_rtx for benchmarking

  • πŸ“š API β€” Complete C++ and Python API references

  • πŸ“– Reference β€” Operator support, deprecation policy, cybersecurity disclosures, and SLA

To learn more about TensorRT for RTX’s C++ and Python API usage, refer to our GitHub: NVIDIA TensorRT-RTX project.

Previous Releases#

πŸ“‹ Release 1.3 Highlights
  • Thread-Safe Multi-GPU Execution β€” Enabled thread-safe execution for multiple GPUs with different compute capabilities, up to one network per thread

  • LLM and Convolution Performance β€” Improved throughput for LLMs with short prompt lengths, INT8 weight-only-quantization for IMatrixMultiplyLayers, and additional kernel fusion patterns for convolution-based models

  • Blackwell CUDA Graphics Mode β€” Supports CUDA contexts created in NVIDIA CUDA graphics mode on Blackwell devices

  • FP8 Performance β€” Improved performance for many FP8 models on Blackwell

  • CUDA 12.9 / 13.1 Support β€” Compatible with NVIDIA CUDA 12.9 and CUDA 13.1

View 1.3 Release Notes

πŸ“‹ Release 1.2 Highlights
  • CUDA Graphs Support β€” Built-in CUDA Graphs with automatic dynamic shape support, enabling one-line changes to accelerate inference workflows by reducing GPU kernel launch overhead

  • User Memory Allocation β€” New kREQUIRE_USER_ALLOCATION builder flag and IExecutionContext::isStreamCapturable() API for CUDA stream capture workflows

  • CUDA 13.0 Support β€” Compatible with NVIDIA CUDA 13.0 Toolkit

  • Library Reorganization β€” DLL libraries moved from lib to bin subdirectory

View 1.2 Release Notes

πŸ“‹ Release 1.1 Highlights
  • Engine Validity API β€” Added IRuntime::getEngineValidity() API to programmatically check engine file compatibility without loading the entire file into memory

  • Faster Compilation β€” Greatly improved compilation time, with an average 1.5x improvement across a variety of model architectures, particularly for models with many memory-bound kernels

View 1.1 Release Notes

πŸ“‹ Release 1.0 Highlights
  • Reduced Binary Size β€” Smaller download size and disk footprint for improved deployment in consumer applications

  • Two-Phase Compilation β€” Hardware-agnostic ahead-of-time (AOT) and hardware-specific just-in-time (JIT) optimization phases for improved user experience

  • System Resource Adaptivity β€” Improved adaptivity to real-system resources for background AI features

  • Windows ML Support β€” Native acceleration support for Windows ML

View 1.0 Release Notes

Note

For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT-RTX GitHub Releases.