NVIDIA TensorRT Documentation#
NVIDIA TensorRT is an SDK for optimizing and accelerating deep learning inference on NVIDIA GPUs. It takes trained models from frameworks such as PyTorch, TensorFlow, and ONNX, and optimizes them for high-performance deployment with support for mixed precision (FP32/FP16/BF16/FP8/INT8), dynamic shapes, and specialized optimizations for transformers and large language models (LLMs).
Quick Start#
π New to NVIDIA TensorRT? β Start with the Quick Start Guide to build and deploy your first optimized inference engine in 30β60 minutes
β¬οΈ Upgrading from 10.16 or earlier? β Refer to Whatβs New in 11.0.0 below
π§ Need help with a specific task? β Jump to the Inference Library for API walkthroughs, dynamic shapes, quantization, and more, or the Troubleshooting section
π Whatβs New in NVIDIA TensorRT 11.0.0#
Latest Release Highlights
Strongly typed networks are now the default β Weak-typing APIs (
setPrecision,setDynamicRange, the per-precisionBuilderFlagfamily) and implicit quantization (IInt8Calibrator) have been removed. Use the NVIDIA TensorRT Migration Guide to plan your upgradeIPluginV2 has been removed β The entire
IPluginV2family is gone; migrate custom plugins toIPluginV3withaddPluginV3(). See the V2 β V3 walkthrough for a side-by-side API mappingMulti-Device Inference is generally available β Preview flag retired, plus new
AllToAll,Gather, andScattercollective ops, automatic NCCL library fallback, and a new context-parallel attention sample. Refer to Multi-Device InferenceRagged batching for attention β
IAttentionandIKVCacheUpdateLayernow support packed (kPACKED_NHD) layouts so variable-length sequences can be concatenated end-to-end without padding. Refer to Fused AttentionMoE inference performance β Significant Blackwell (SM10x/SM110) backend improvements close the gap to specialized external MoE kernels; the previous βkeep
seqLenβ€ 16β guidance no longer applies. Refer to MoE (Mixture of Experts)Rewritten Best Practices and Benchmarking guide β Reframed as a measure-then-optimize loop with side-by-side ONNX-TRT (
trtexec) and Torch-TRT workflows in synchronized tabs covering quantization, dynamic shapes, CUDA graphs, profiling, and Nsight Systems timeline reading. Refer to Performance BenchmarkingPlatform updates β RHEL 10 / Rocky Linux 10 RPM and tar packages, and a new TensorRT 10.x to 11.x migration path with dedicated DriveOS and Jetson/JetPack chapters
Previous Releases#
π Release 10.16.1 Highlights
TensorRT 11.0.0 Coming Soon β New capabilities for PyTorch/Hugging Face integration, modernized APIs, removal of legacy weakly-typed APIs. Migrate early to Strongly Typed Networks, Explicit Quantization, and IPluginV3
JetPack Support for Orin iGPUs β Orin iGPU support via the ARM SBSA build, available as an early-access download ahead of JetPack 7.x
Safety Headers Included β Functional safety headers for ISO 26262-compliant applications are now included in all standard TensorRT packages
Interactive Sample Explorer β Browse all TensorRT samples by difficulty, language, or use case
Interactive Support Matrix β Filterable support matrix with three explorers for system requirements, hardware capabilities, and feature support
π¦ Archived Releases
Earlier TensorRT 10.x releases with key highlights:
10.16.0 Release Notes - Multi-Device Inference (Preview), MoE (Mixture of Experts), Interactive Sample Explorer, Interactive Support Matrix, API Capture and Replay Multi-Network Support, Internal Library Path API, Breaking ABI Changes
10.15.1 Release Notes - KV Cache Reuse API, built-in RoPE support, Blackwell Windows production-ready, DLA-only mode
10.14.1 Release Notes - GB300/DGX B300/DGX Spark support, IAttention API, flexible output indices, partitioned builder resources, engine statistics API
10.13.3 Release Notes - API Capture and Replay, Python 3.8 and 3.9 deprecation notice, FP4 build time improvements, samples installation fix
10.13.2 Release Notes - CUDA 13.0 support, FP8 convolution improvements, JetPack/SBSA consolidation
10.13.0 Release Notes - Custom weights loading APIs, enhanced MHA on Blackwell, NVFP4 fusions
10.12.0 Release Notes - MXFP8 quantization support, enhanced debug tensor feature, distributive independence determinism
10.11.0 Release Notes - Condition-dependent shapes, large tensor support, static libraries deprecation
10.10.0 Release Notes - Enhanced large tensor handling, Blackwell GPU performance improvements
10.9.0 Release Notes - Same compute capability compatibility, AOT compilable Python plugins
10.8.0 Release Notes - Blackwell GPU support, E2M1 FP4 data type, tiling optimization
10.7.0 Release Notes - Nsight Deep Learning Designer support, engine deserialization API
10.6.0 Release Notes - Quickly Deployable Plugins (QDPs), FP8 MHA on Ada GPUs
10.5.0 Release Notes - Linux SBSA Python wheels, Volta support removed
10.4.0 Release Notes - Ubuntu 24.04 support, LLM build time improvements
10.3.0 Release Notes - Cross-platform engine support (experimental), FP8 convolution on Ada
10.2.0 Release Notes - FP8 convolution support, fine-grained refit control
10.1.0 Release Notes - Advanced weight streaming APIs, enhanced device memory management
10.0.1 Release Notes - Weight streaming, INT4 weight-only quantization, IPluginV3 framework
10.0.0 Early Access Release Notes - Initial TensorRT 10.x preview release
π Legacy Versions
Note
For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT GitHub Releases.