NVIDIA TensorRT Documentation#

NVIDIA TensorRT is an SDK for optimizing and accelerating deep learning inference on NVIDIA GPUs. It takes trained models from frameworks such as PyTorch and intermediate formats such as ONNX, and optimizes them for high-performance deployment with support for mixed precision (FP32/FP16/BF16/FP8/INT8), dynamic shapes, and specialized optimizations for transformers and large language models (LLMs).

Quick Start#

πŸ†• What’s New in NVIDIA TensorRT 10.16.0#

Latest Release Highlights

  • TensorRT 11.0 Coming Soon β€” New capabilities for PyTorch/Hugging Face integration, modernized APIs, removal of legacy weakly-typed APIs

  • Multi-Device Inference (Preview) β€” Scale inference across multiple GPUs with IDistCollectiveLayer and multi-device attention via NCCL

  • MoE (Mixture of Experts) β€” Built-in IMoELayer for transformer MoE blocks on SM110 with NVFP4/FP8 quantization

  • Interactive Sample Explorer β€” Browse all TensorRT samples by difficulty, language, or use case

  • Interactive Support Matrix β€” Filterable support matrix with three explorers for system requirements, hardware capabilities, and feature support; contains all 10.x releases

  • API Capture and Replay Multi-Network Support β€” Capture and replay multiple networks within a single process for ensemble models and multi-stage inference pipelines

  • Internal Library Path API β€” New setInternalLibraryPath API for custom builder resource locations

  • Breaking ABI Changes β€” Windows DLL files moved from lib/ to bin/ subdirectory; libonnx_proto.a merged into libnvonnxparser_static.a

View 10.16.0 Release Notes

What You’ll Find Here#

  • πŸš€ Getting Started - Quick start guide, release notes, and platform support matrix

  • πŸ“¦ Installing TensorRT - Installation requirements, prerequisites, and step-by-step setup instructions

  • πŸ—οΈ Architecture - TensorRT design overview, optimization capabilities, and how the inference engine works

  • πŸ”§ Inference Library - C++ and Python APIs, interactive sample explorer, and advanced features like quantization and dynamic shapes

  • ⚑ Performance - Best practices for optimization and using trtexec for benchmarking

  • πŸ“š API - Complete API references for C++, Python, ONNX GraphSurgeon, and Polygraphy tools

  • πŸ“– Reference - Troubleshooting guides, operator support, command-line tools, and glossary

Previous Releases#

πŸ“‹ Release 10.15.1 Highlights
  • KV Cache Reuse API - New KVCacheUpdate API for efficient KV cache reuse in transformer models, significantly improving LLM performance

  • Built-in RoPE Support - Native support for Rotary Position Embedding with new RotaryEmbedding API layer for easier transformer deployment

  • Blackwell GPU Support - B200 and B300 GPU support on Windows is now fully production-ready (no longer experimental)

  • DLA-Only Mode - New kREPORT_CAPABILITY_DLA ONNX Parser flag for generating engines that run exclusively on DLA without GPU fallback

  • Performance Fixes - Resolved multiple regressions on Blackwell GPUs: up to 9% for FLUX FP16, 24% for ResNext-50 FP8, 25% for ConvNets with GlobalAveragePool, and 10% for BERT FP16

View 10.15.1 Release Notes

πŸ“‹ Release 10.14.1 Highlights
  • New GPU Support - Added support for NVIDIA GB300, DGX B300, and DGX Spark with functionally complete and performant drivers

  • IAttention API - New fused attention operator API for improved transformer model performance with automatic head padding for better alignment

  • Flexible Output Indices - New APIs for Topk, NMS, and NonZero operations to control output indices data type (INT32 or INT64)

  • Partitioned Builder Resources - Architecture-specific builder resources to reduce memory usage during engine build

  • Engine Statistics API - New getEngineStat() API for querying precise weight sizes and engine metrics

  • Performance Improvements - Fixed up to 78% FP8 regression on Blackwell for densenet121, 55% MHA regression for ViT models, and 120 MB memory regression for FLUX

View 10.14.1 Release Notes

πŸ“‹ Release 10.13.3 Highlights
  • API Capture and Replay - New debugging tool that streamlines reproducing and debugging issues within TensorRT applications

  • Python 3.8 and 3.9 Deprecation Notice - Python versions 3.8 and 3.9 no longer support all samples; Python 3.10-3.12 recommended

  • FP4 Build Time Improvements - Fixed significant build time regressions for FP4 quantized networks on Thor Jetson and Thor GPUs with CUDA 13.0

  • Samples Installation Fix - Resolved build failures in minimal container environments when including cuda_profiler_api.h

View 10.13.3 Release Notes

πŸ“‹ Release 10.12.0 Highlights
  • MXFP8 Quantization Support - Block quantization across 32 high-precision elements with E8M0 scaling factor for improved model compression

  • Enhanced Debug Tensor Feature - Mark all unfused tensors as debug tensors without preventing fusion, with support for NumPy, string, and raw data formats

  • Distributive Independence Determinism - Guarantee identical outputs across distributive axis when inputs are identical, improving reproducibility

  • Weak Typing APIs Deprecated - Migration to strong-typing exclusively; refer to Strong Typing vs Weak Typing guide for migration

  • Refactored Python Samples - New samples with cleaner structure: 1_run_onnx_with_tensorrt and 2_construct_network_with_layer_apis

View 10.12.0 Release Notes

πŸ“¦ Archived Releases (10.0 - 10.11)

Earlier TensorRT 10.x releases with key highlights:

πŸ“– Legacy Versions

Note

For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT GitHub Releases.