NVIDIA TensorRT Documentation#
NVIDIA TensorRT is an SDK for optimizing and accelerating deep learning inference on NVIDIA GPUs. It takes trained models from frameworks such as PyTorch, TensorFlow, and ONNX, and optimizes them for high-performance deployment with support for mixed precision (FP32/FP16/BF16/FP8/INT8), dynamic shapes, and specialized optimizations for transformers and large language models (LLMs).
Quick Start#
🆕 New to NVIDIA TensorRT? → Start with the Quick Start Guide to build and deploy your first optimized inference engine in 30–60 minutes
⬆️ Upgrading from 10.14 or earlier? → See What’s New in 10.15.1 below
🔧 Need help with a specific task? → Jump to the Installing TensorRT or Troubleshooting section
🆕 What’s New in NVIDIA TensorRT 10.15.1#
Latest Release Highlights
Transformer and LLM Optimizations:
KV Cache Reuse API - New
KVCacheUpdateAPI for efficient KV cache reuse in transformer models, significantly improving LLM performanceBuilt-in RoPE Support - Native support for Rotary Position Embedding with new RotaryEmbedding API layer for easier transformer deployment
Enhanced Dynamic Quantization - Support for 2D blocks in Dynamic Quantization and ND blocks in Quantize/Dequantize for Sage Attention and per-token quantization
Hardware and Performance:
Blackwell GPU Support - B200 and B300 GPU support on Windows is now fully production-ready (no longer experimental)
Performance Fixes - Resolved multiple regressions on Blackwell GPUs: up to 9% for FLUX FP16, 24% for ResNext-50 FP8, 25% for ConvNets with GlobalAveragePool, and 10% for BERT FP16
Python API Performance - Fixed up to 40% performance regression with
set_input_shapefrom Python bindingMemory Leak Fix - Resolved host memory leak when building engines on NVIDIA Blackwell GPUs
DLA and Parser Enhancements:
DLA-Only Mode - New
kREPORT_CAPABILITY_DLAONNX Parser flag for generating engines that run exclusively on DLA without GPU fallbackPlugin Override Control - New
kENABLE_PLUGIN_OVERRIDEflag for improved handling when TensorRT plugins share names with standard ONNX operatorsFused Multi-Head Attention - Multiple pointwise inputs now supported, and fixed bug preventing multiple
IAttentionlayers inINetwork
Samples and Tools:
Strongly Typed Networks Sample - New
strongly_type_autocastPython sample demonstrating ModelOpt’s AutoCast tool for FP32-to-FP16 mixed precision conversion
What You’ll Find Here#
🚀 Getting Started - Quick start guide, release notes, and platform support matrix
📦 Installing TensorRT - Installation requirements, prerequisites, and step-by-step setup instructions
🏗️ Architecture - TensorRT design overview, optimization capabilities, and how the inference engine works
🔧 Inference Library - C++ and Python APIs, code samples, and advanced features like quantization and dynamic shapes
⚡ Performance - Best practices for optimization and using trtexec for benchmarking
📚 API - Complete API references for C++, Python, ONNX GraphSurgeon, and Polygraphy tools
📖 Reference - Troubleshooting guides, operator support, command-line tools, and glossary
Previous Releases#
📋 Release 10.14.1 Highlights
New GPU Support - Added support for NVIDIA GB300, DGX B300, and DGX Spark with functionally complete and performant drivers
IAttention API - New fused attention operator API for improved transformer model performance with automatic head padding for better alignment
Flexible Output Indices - New APIs for
Topk,NMS, andNonZerooperations to control output indices data type (INT32 or INT64)Partitioned Builder Resources - Architecture-specific builder resources to reduce memory usage during engine build
Engine Statistics API - New
getEngineStat()API for querying precise weight sizes and engine metricsPerformance Improvements - Fixed up to 78% FP8 regression on Blackwell for densenet121, 55% MHA regression for ViT models, and 120 MB memory regression for FLUX
📋 Release 10.13.3 Highlights
API Capture and Replay - New debugging tool that streamlines reproducing and debugging issues within TensorRT applications
Python 3.8 and 3.9 Deprecation Notice - Python versions 3.8 and 3.9 no longer support all samples; Python 3.10-3.12 recommended
FP4 Build Time Improvements - Fixed significant build time regressions for FP4 quantized networks on Thor Jetson and Thor GPUs with CUDA 13.0
Samples Installation Fix - Resolved build failures in minimal container environments when including
cuda_profiler_api.h
📋 Release 10.12.0 Highlights
MXFP8 Quantization Support - Block quantization across 32 high-precision elements with E8M0 scaling factor for improved model compression
Enhanced Debug Tensor Feature - Mark all unfused tensors as debug tensors without preventing fusion, with support for NumPy, string, and raw data formats
Distributive Independence Determinism - Guarantee identical outputs across distributive axis when inputs are identical, improving reproducibility
Weak Typing APIs Deprecated - Migration to strong-typing exclusively; refer to Strong Typing vs Weak Typing guide for migration
Refactored Python Samples - New samples with cleaner structure:
1_run_onnx_with_tensorrtand2_construct_network_with_layer_apis
📦 Archived Releases (10.0 - 10.11)
Earlier TensorRT 10.x releases with key highlights:
10.13.2 Release Notes - CUDA 13.0 support, FP8 convolution improvements, JetPack/SBSA consolidation
10.13.0 Release Notes - Custom weights loading APIs, enhanced MHA on Blackwell, NVFP4 fusions
10.11.0 Release Notes - Condition-dependent shapes, large tensor support, static libraries deprecation
10.10.0 Release Notes - Enhanced large tensor handling, Blackwell GPU performance improvements
10.9.0 Release Notes - Same compute capability compatibility, AOT compilable Python plugins
10.8.0 Release Notes - Blackwell GPU support, E2M1 FP4 data type, tiling optimization
10.7.0 Release Notes - Nsight Deep Learning Designer support, engine deserialization API
10.6.0 Release Notes - Quickly Deployable Plugins (QDPs), FP8 MHA on Ada GPUs
10.5.0 Release Notes - Linux SBSA Python wheels, Volta support removed
10.4.0 Release Notes - Ubuntu 24.04 support, LLM build time improvements
10.3.0 Release Notes - Cross-platform engine support (experimental), FP8 convolution on Ada
10.2.0 Release Notes - FP8 convolution support, fine-grained refit control
10.1.0 Release Notes - Advanced weight streaming APIs, enhanced device memory management
10.0.1 Release Notes - Weight streaming, INT4 weight-only quantization, IPluginV3 framework
10.0.0 Early Access Release Notes - Initial TensorRT 10.x preview release
Legacy Versions:
Note
For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT GitHub Releases.