NVIDIA TensorRT Documentation#

NVIDIA TensorRT is an SDK for optimizing and accelerating deep learning inference on NVIDIA GPUs. It takes trained models from frameworks such as PyTorch, TensorFlow, and ONNX, and optimizes them for high-performance deployment with support for mixed precision (FP32/FP16/BF16/FP8/INT8), dynamic shapes, and specialized optimizations for transformers and large language models (LLMs).

Quick Start#

🆕 What’s New in NVIDIA TensorRT 10.15.1#

Latest Release Highlights

Transformer and LLM Optimizations:

  • KV Cache Reuse API - New KVCacheUpdate API for efficient KV cache reuse in transformer models, significantly improving LLM performance

  • Built-in RoPE Support - Native support for Rotary Position Embedding with new RotaryEmbedding API layer for easier transformer deployment

  • Enhanced Dynamic Quantization - Support for 2D blocks in Dynamic Quantization and ND blocks in Quantize/Dequantize for Sage Attention and per-token quantization

Hardware and Performance:

  • Blackwell GPU Support - B200 and B300 GPU support on Windows is now fully production-ready (no longer experimental)

  • Performance Fixes - Resolved multiple regressions on Blackwell GPUs: up to 9% for FLUX FP16, 24% for ResNext-50 FP8, 25% for ConvNets with GlobalAveragePool, and 10% for BERT FP16

  • Python API Performance - Fixed up to 40% performance regression with set_input_shape from Python binding

  • Memory Leak Fix - Resolved host memory leak when building engines on NVIDIA Blackwell GPUs

DLA and Parser Enhancements:

  • DLA-Only Mode - New kREPORT_CAPABILITY_DLA ONNX Parser flag for generating engines that run exclusively on DLA without GPU fallback

  • Plugin Override Control - New kENABLE_PLUGIN_OVERRIDE flag for improved handling when TensorRT plugins share names with standard ONNX operators

  • Fused Multi-Head Attention - Multiple pointwise inputs now supported, and fixed bug preventing multiple IAttention layers in INetwork

Samples and Tools:

  • Strongly Typed Networks Sample - New strongly_type_autocast Python sample demonstrating ModelOpt’s AutoCast tool for FP32-to-FP16 mixed precision conversion

View 10.15.1 Release Notes

What You’ll Find Here#

  • 🚀 Getting Started - Quick start guide, release notes, and platform support matrix

  • 📦 Installing TensorRT - Installation requirements, prerequisites, and step-by-step setup instructions

  • 🏗️ Architecture - TensorRT design overview, optimization capabilities, and how the inference engine works

  • 🔧 Inference Library - C++ and Python APIs, code samples, and advanced features like quantization and dynamic shapes

  • Performance - Best practices for optimization and using trtexec for benchmarking

  • 📚 API - Complete API references for C++, Python, ONNX GraphSurgeon, and Polygraphy tools

  • 📖 Reference - Troubleshooting guides, operator support, command-line tools, and glossary

Previous Releases#

📋 Release 10.14.1 Highlights
  • New GPU Support - Added support for NVIDIA GB300, DGX B300, and DGX Spark with functionally complete and performant drivers

  • IAttention API - New fused attention operator API for improved transformer model performance with automatic head padding for better alignment

  • Flexible Output Indices - New APIs for Topk, NMS, and NonZero operations to control output indices data type (INT32 or INT64)

  • Partitioned Builder Resources - Architecture-specific builder resources to reduce memory usage during engine build

  • Engine Statistics API - New getEngineStat() API for querying precise weight sizes and engine metrics

  • Performance Improvements - Fixed up to 78% FP8 regression on Blackwell for densenet121, 55% MHA regression for ViT models, and 120 MB memory regression for FLUX

View 10.14.1 Release Notes

📋 Release 10.13.3 Highlights
  • API Capture and Replay - New debugging tool that streamlines reproducing and debugging issues within TensorRT applications

  • Python 3.8 and 3.9 Deprecation Notice - Python versions 3.8 and 3.9 no longer support all samples; Python 3.10-3.12 recommended

  • FP4 Build Time Improvements - Fixed significant build time regressions for FP4 quantized networks on Thor Jetson and Thor GPUs with CUDA 13.0

  • Samples Installation Fix - Resolved build failures in minimal container environments when including cuda_profiler_api.h

View 10.13.3 Release Notes

📋 Release 10.12.0 Highlights
  • MXFP8 Quantization Support - Block quantization across 32 high-precision elements with E8M0 scaling factor for improved model compression

  • Enhanced Debug Tensor Feature - Mark all unfused tensors as debug tensors without preventing fusion, with support for NumPy, string, and raw data formats

  • Distributive Independence Determinism - Guarantee identical outputs across distributive axis when inputs are identical, improving reproducibility

  • Weak Typing APIs Deprecated - Migration to strong-typing exclusively; refer to Strong Typing vs Weak Typing guide for migration

  • Refactored Python Samples - New samples with cleaner structure: 1_run_onnx_with_tensorrt and 2_construct_network_with_layer_apis

View 10.12.0 Release Notes

📦 Archived Releases (10.0 - 10.11)

Earlier TensorRT 10.x releases with key highlights:

Legacy Versions:

Note

For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT GitHub Releases.