Is this page helpful?

NVIDIA TensorRT Documentation#

NVIDIA TensorRT is an SDK for optimizing and accelerating deep learning inference on NVIDIA GPUs. It takes trained models from frameworks such as PyTorch, TensorFlow, and ONNX, and optimizes them for high-performance deployment with support for mixed precision (FP32/FP16/BF16/FP8/INT8), dynamic shapes, and specialized optimizations for transformers and large language models (LLMs).

Quick Start#

🆕 New to NVIDIA TensorRT? → Start with the Quick Start Guide to build and deploy your first optimized inference engine in 30–60 minutes
⬆️ Upgrading from 10.14 or earlier? → See What’s New in 10.15.1 below
🔧 Need help with a specific task? → Jump to the Installing TensorRT or Troubleshooting section

🆕 What’s New in NVIDIA TensorRT 10.15.1#

Latest Release Highlights

Transformer and LLM Optimizations:

KV Cache Reuse API - New KVCacheUpdate API for efficient KV cache reuse in transformer models, significantly improving LLM performance
Built-in RoPE Support - Native support for Rotary Position Embedding with new RotaryEmbedding API layer for easier transformer deployment
Enhanced Dynamic Quantization - Support for 2D blocks in Dynamic Quantization and ND blocks in Quantize/Dequantize for Sage Attention and per-token quantization

Hardware and Performance:

Blackwell GPU Support - B200 and B300 GPU support on Windows is now fully production-ready (no longer experimental)
Performance Fixes - Resolved multiple regressions on Blackwell GPUs: up to 9% for FLUX FP16, 24% for ResNext-50 FP8, 25% for ConvNets with GlobalAveragePool, and 10% for BERT FP16
Python API Performance - Fixed up to 40% performance regression with set_input_shape from Python binding
Memory Leak Fix - Resolved host memory leak when building engines on NVIDIA Blackwell GPUs

DLA and Parser Enhancements:

DLA-Only Mode - New kREPORT_CAPABILITY_DLA ONNX Parser flag for generating engines that run exclusively on DLA without GPU fallback
Plugin Override Control - New kENABLE_PLUGIN_OVERRIDE flag for improved handling when TensorRT plugins share names with standard ONNX operators
Fused Multi-Head Attention - Multiple pointwise inputs now supported, and fixed bug preventing multiple IAttention layers in INetwork

Samples and Tools:

Strongly Typed Networks Sample - New strongly_type_autocast Python sample demonstrating ModelOpt’s AutoCast tool for FP32-to-FP16 mixed precision conversion

View 10.15.1 Release Notes

What You’ll Find Here#

🚀 Getting Started - Quick start guide, release notes, and platform support matrix
📦 Installing TensorRT - Installation requirements, prerequisites, and step-by-step setup instructions
🏗️ Architecture - TensorRT design overview, optimization capabilities, and how the inference engine works
🔧 Inference Library - C++ and Python APIs, code samples, and advanced features like quantization and dynamic shapes
⚡ Performance - Best practices for optimization and using trtexec for benchmarking
📚 API - Complete API references for C++, Python, ONNX GraphSurgeon, and Polygraphy tools
📖 Reference - Troubleshooting guides, operator support, command-line tools, and glossary

Previous Releases#

Note

For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT GitHub Releases.