NVIDIA TensorRT Documentation#
NVIDIA TensorRT is an SDK for optimizing and accelerating deep learning inference on NVIDIA GPUs. It takes trained models from frameworks such as PyTorch, TensorFlow, and ONNX, and optimizes them for high-performance deployment with support for mixed precision (FP32/FP16/BF16/FP8/INT8), dynamic shapes, and specialized optimizations for transformers and large language models (LLMs).
Quick Start#
π New to NVIDIA TensorRT? β Start with the Quick Start Guide to build and deploy your first optimized inference engine in 30β60 minutes
β¬οΈ Upgrading from 10.16.0 or earlier? β Refer to Whatβs New in 10.16.1 below
π§ Need help with a specific task? β Jump to the Installing TensorRT or Troubleshooting section
π Whatβs New in NVIDIA TensorRT 10.16.1#
Latest Release Highlights
TensorRT 11.0 Coming Soon β New capabilities for PyTorch/Hugging Face integration, modernized APIs, removal of legacy weakly-typed APIs. Migrate early to Strongly Typed Networks, Explicit Quantization, and IPluginV3
JetPack Support for Orin iGPUs β Orin iGPU support via the ARM SBSA build, available as an early-access download ahead of JetPack 7.x
Safety Headers Included β Functional safety headers for ISO 26262-compliant applications are now included in all standard TensorRT packages
Interactive Sample Explorer β Browse all TensorRT samples by difficulty, language, or use case
Interactive Support Matrix β Filterable support matrix with three explorers for system requirements, hardware capabilities, and feature support
Previous Releases#
π Release 10.16.0 Highlights
Multi-Device Inference (Preview) β Scale inference across multiple GPUs with
IDistCollectiveLayerand multi-device attention via NCCLMoE (Mixture of Experts) β Built-in
IMoELayerfor transformer MoE blocks on SM110 with NVFP4/FP8 quantizationInteractive Sample Explorer β Browse all TensorRT samples by difficulty, language, or use case
Interactive Support Matrix β Filterable support matrix with three explorers for system requirements, hardware capabilities, and feature support; contains all 10.x releases
API Capture and Replay Multi-Network Support β Capture and replay multiple networks within a single process for ensemble models and multi-stage inference pipelines
Internal Library Path API β New
setInternalLibraryPathAPI for custom builder resource locationsBreaking ABI Changes β Windows DLL files moved from
lib/tobin/subdirectory;libonnx_proto.amerged intolibnvonnxparser_static.a
π Release 10.15.1 Highlights
KV Cache Reuse API - New
KVCacheUpdateAPI for efficient KV cache reuse in transformer models, significantly improving LLM performanceBuilt-in RoPE Support - Native support for Rotary Position Embedding with new RotaryEmbedding API layer for easier transformer deployment
Blackwell GPU Support - B200 and B300 GPU support on Windows is now fully production-ready (no longer experimental)
DLA-Only Mode - New
kREPORT_CAPABILITY_DLAONNX Parser flag for generating engines that run exclusively on DLA without GPU fallbackPerformance Fixes - Resolved multiple regressions on Blackwell GPUs: up to 9% for FLUX FP16, 24% for ResNext-50 FP8, 25% for ConvNets with GlobalAveragePool, and 10% for BERT FP16
π Release 10.14.1 Highlights
New GPU Support - Added support for NVIDIA GB300, DGX B300, and DGX Spark with functionally complete and performant drivers
IAttention API - New fused attention operator API for improved transformer model performance with automatic head padding for better alignment
Flexible Output Indices - New APIs for
Topk,NMS, andNonZerooperations to control output indices data type (INT32 or INT64)Partitioned Builder Resources - Architecture-specific builder resources to reduce memory usage during engine build
Engine Statistics API - New
getEngineStat()API for querying precise weight sizes and engine metricsPerformance Improvements - Fixed up to 78% FP8 regression on Blackwell for densenet121, 55% MHA regression for ViT models, and 120 MB memory regression for FLUX
π¦ Archived Releases (10.0 - 10.11)
Earlier TensorRT 10.x releases with key highlights:
10.13.3 Release Notes - API Capture and Replay, Python 3.8 and 3.9 deprecation notice, FP4 build time improvements, samples installation fix
10.13.2 Release Notes - CUDA 13.0 support, FP8 convolution improvements, JetPack/SBSA consolidation
10.13.0 Release Notes - Custom weights loading APIs, enhanced MHA on Blackwell, NVFP4 fusions
10.12.0 Release Notes - MXFP8 quantization support, enhanced debug tensor feature, distributive independence determinism
10.11.0 Release Notes - Condition-dependent shapes, large tensor support, static libraries deprecation
10.10.0 Release Notes - Enhanced large tensor handling, Blackwell GPU performance improvements
10.9.0 Release Notes - Same compute capability compatibility, AOT compilable Python plugins
10.8.0 Release Notes - Blackwell GPU support, E2M1 FP4 data type, tiling optimization
10.7.0 Release Notes - Nsight Deep Learning Designer support, engine deserialization API
10.6.0 Release Notes - Quickly Deployable Plugins (QDPs), FP8 MHA on Ada GPUs
10.5.0 Release Notes - Linux SBSA Python wheels, Volta support removed
10.4.0 Release Notes - Ubuntu 24.04 support, LLM build time improvements
10.3.0 Release Notes - Cross-platform engine support (experimental), FP8 convolution on Ada
10.2.0 Release Notes - FP8 convolution support, fine-grained refit control
10.1.0 Release Notes - Advanced weight streaming APIs, enhanced device memory management
10.0.1 Release Notes - Weight streaming, INT4 weight-only quantization, IPluginV3 framework
10.0.0 Early Access Release Notes - Initial TensorRT 10.x preview release
π Legacy Versions
Note
For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT GitHub Releases.