NVIDIA TensorRT Documentation#

NVIDIA TensorRT is an SDK for optimizing and accelerating deep learning inference on NVIDIA GPUs. It takes trained models from frameworks such as PyTorch, TensorFlow, and ONNX, and optimizes them for high-performance deployment with support for mixed precision (FP32/FP16/BF16/FP8/INT8), dynamic shapes, and specialized optimizations for transformers and large language models (LLMs).

Quick Start#

πŸ†• What’s New in NVIDIA TensorRT 10.16.1#

Latest Release Highlights

  • TensorRT 11.0 Coming Soon β€” New capabilities for PyTorch/Hugging Face integration, modernized APIs, removal of legacy weakly-typed APIs. Migrate early to Strongly Typed Networks, Explicit Quantization, and IPluginV3

  • JetPack Support for Orin iGPUs β€” Orin iGPU support via the ARM SBSA build, available as an early-access download ahead of JetPack 7.x

  • Safety Headers Included β€” Functional safety headers for ISO 26262-compliant applications are now included in all standard TensorRT packages

  • Interactive Sample Explorer β€” Browse all TensorRT samples by difficulty, language, or use case

  • Interactive Support Matrix β€” Filterable support matrix with three explorers for system requirements, hardware capabilities, and feature support

View 10.16.1 Release Notes

Previous Releases#

πŸ“‹ Release 10.16.0 Highlights
  • Multi-Device Inference (Preview) β€” Scale inference across multiple GPUs with IDistCollectiveLayer and multi-device attention via NCCL

  • MoE (Mixture of Experts) β€” Built-in IMoELayer for transformer MoE blocks on SM110 with NVFP4/FP8 quantization

  • Interactive Sample Explorer β€” Browse all TensorRT samples by difficulty, language, or use case

  • Interactive Support Matrix β€” Filterable support matrix with three explorers for system requirements, hardware capabilities, and feature support; contains all 10.x releases

  • API Capture and Replay Multi-Network Support β€” Capture and replay multiple networks within a single process for ensemble models and multi-stage inference pipelines

  • Internal Library Path API β€” New setInternalLibraryPath API for custom builder resource locations

  • Breaking ABI Changes β€” Windows DLL files moved from lib/ to bin/ subdirectory; libonnx_proto.a merged into libnvonnxparser_static.a

View 10.16.0 Release Notes

πŸ“‹ Release 10.15.1 Highlights
  • KV Cache Reuse API - New KVCacheUpdate API for efficient KV cache reuse in transformer models, significantly improving LLM performance

  • Built-in RoPE Support - Native support for Rotary Position Embedding with new RotaryEmbedding API layer for easier transformer deployment

  • Blackwell GPU Support - B200 and B300 GPU support on Windows is now fully production-ready (no longer experimental)

  • DLA-Only Mode - New kREPORT_CAPABILITY_DLA ONNX Parser flag for generating engines that run exclusively on DLA without GPU fallback

  • Performance Fixes - Resolved multiple regressions on Blackwell GPUs: up to 9% for FLUX FP16, 24% for ResNext-50 FP8, 25% for ConvNets with GlobalAveragePool, and 10% for BERT FP16

View 10.15.1 Release Notes

πŸ“‹ Release 10.14.1 Highlights
  • New GPU Support - Added support for NVIDIA GB300, DGX B300, and DGX Spark with functionally complete and performant drivers

  • IAttention API - New fused attention operator API for improved transformer model performance with automatic head padding for better alignment

  • Flexible Output Indices - New APIs for Topk, NMS, and NonZero operations to control output indices data type (INT32 or INT64)

  • Partitioned Builder Resources - Architecture-specific builder resources to reduce memory usage during engine build

  • Engine Statistics API - New getEngineStat() API for querying precise weight sizes and engine metrics

  • Performance Improvements - Fixed up to 78% FP8 regression on Blackwell for densenet121, 55% MHA regression for ViT models, and 120 MB memory regression for FLUX

View 10.14.1 Release Notes

πŸ“¦ Archived Releases (10.0 - 10.11)

Earlier TensorRT 10.x releases with key highlights:

πŸ“– Legacy Versions

Note

For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT GitHub Releases.