NVIDIA TensorRT for RTX Documentation#

NVIDIA TensorRT for RTX builds on the proven performance of the NVIDIA TensorRT inference library and simplifies the deployment of AI models on NVIDIA RTX GPUs across desktops, laptops, and workstations. It introduces a Just-In-Time (JIT) optimizer in the runtime that compiles optimized inference engines directly on the end-user’s RTX-accelerated PC.

The two-phase compilation process (AOT + JIT) typically completes in under 30 seconds total, eliminating lengthy per-device pre-compilation steps and enabling rapid engine generation with improved application portability. TensorRT for RTX is a compact, under 200 MB, drop-in replacement for NVIDIA TensorRT targeting NVIDIA RTX GPUs from NVIDIA Turing (compute capability 7.5) through NVIDIA Blackwell (compute capability 10.0) generations.

In the following documentation, TensorRT for RTX is referred to as TensorRT-RTX.

Quick Start#

  • πŸ†• New to NVIDIA TensorRT-RTX? β†’ Deploy Your First Model walks you through building an engine from an ONNX model and running inference in minutes

  • ⬆️ Upgrading from 1.3 or earlier? β†’ Refer to What’s New in 1.4 below

  • πŸ”§ Need help with a specific task? β†’ Jump to the Inference Library for API walkthroughs, dynamic shapes, CUDA graphs, and more

πŸ†• What’s New in TensorRT-RTX 1.4#

Release Highlights

  • CUDA 13.2 Support β€” Compatible with NVIDIA CUDA 13.2 Toolkit

  • PyPI Availability β€” Install TensorRT-RTX Python bindings directly from PyPI with pip install tensorrt-rtx

  • API Capture and Replay β€” New debugging feature that records TensorRT-RTX API calls during engine building and replays them for issue reproduction without requiring the original application or model source code. See API Capture and Replay documentation. (Linux only)

  • GPU Latency Optimizations β€” Improved performance with optimized 1D convolution kernels, optimized GEMV kernels, new Windows backend for batch size = 1 convolutions, improved JIT compilation heuristics, faster JIT compilation, and enhanced multi-head attention (MHA) performance

  • Parallel CUDA Graph Capture β€” Enabled running multiple inference contexts in parallel with CUDA graph capture using unique streams per context

  • Compute-in-Graphics (CiG) Improvements β€” Fixed performance issues and segmentation faults on Blackwell GPUs, improved MHA kernel shared memory handling

View 1.4 Release Notes

Previous Releases#

πŸ“‹ Release 1.3 Highlights
  • Thread-Safe Multi-GPU Execution β€” Enabled thread-safe execution for multiple GPUs with different compute capabilities, up to one network per thread

  • LLM and Convolution Performance β€” Improved throughput for LLMs with short prompt lengths, INT8 weight-only-quantization for IMatrixMultiplyLayers, and additional kernel fusion patterns for convolution-based models

  • Blackwell CUDA Graphics Mode β€” Supports CUDA contexts created in NVIDIA CUDA graphics mode on Blackwell devices

  • FP8 Performance β€” Improved performance for many FP8 models on Blackwell

  • CUDA 12.9 / 13.1 Support β€” Compatible with NVIDIA CUDA 12.9 and CUDA 13.1

View 1.3 Release Notes

πŸ“¦ Archived Releases

Earlier TensorRT-RTX 1.x releases with key highlights:

  • 1.2 Release Notes - CUDA Graphs Support, User Memory Allocation, CUDA 13.0 Support, Library Reorganization

  • 1.1 Release Notes - Engine Validity API, Faster Compilation

  • 1.0 Release Notes - Reduced Binary Size, Two-Phase Compilation, System Resource Adaptivity, Windows ML Support

Note

For detailed changelogs, refer to the TensorRT-RTX GitHub Releases.