NVIDIA TensorRT Documentation#

NVIDIA TensorRT is an SDK for optimizing and accelerating deep learning inference on NVIDIA GPUs. It takes trained models from frameworks such as PyTorch, TensorFlow, and ONNX, and optimizes them for high-performance deployment with support for mixed precision (FP32/FP16/BF16/FP8/INT8), dynamic shapes, and specialized optimizations for transformers and large language models (LLMs).

Quick Start#

  • πŸ†• New to NVIDIA TensorRT? β†’ Start with the Quick Start Guide to build and deploy your first optimized inference engine in 30–60 minutes

  • ⬆️ Upgrading from 10.16 or earlier? β†’ Refer to What’s New in 11.0.0 below

  • πŸ”§ Need help with a specific task? β†’ Jump to the Inference Library for API walkthroughs, dynamic shapes, quantization, and more, or the Troubleshooting section

πŸ†• What’s New in NVIDIA TensorRT 11.0.0#

Latest Release Highlights

  • Strongly typed networks are now the default β€” Weak-typing APIs (setPrecision, setDynamicRange, the per-precision BuilderFlag family) and implicit quantization (IInt8Calibrator) have been removed. Use the NVIDIA TensorRT Migration Guide to plan your upgrade

  • IPluginV2 has been removed β€” The entire IPluginV2 family is gone; migrate custom plugins to IPluginV3 with addPluginV3(). See the V2 β†’ V3 walkthrough for a side-by-side API mapping

  • Multi-Device Inference is generally available β€” Preview flag retired, plus new AllToAll, Gather, and Scatter collective ops, automatic NCCL library fallback, and a new context-parallel attention sample. Refer to Multi-Device Inference

  • Ragged batching for attention β€” IAttention and IKVCacheUpdateLayer now support packed (kPACKED_NHD) layouts so variable-length sequences can be concatenated end-to-end without padding. Refer to Fused Attention

  • MoE inference performance β€” Significant Blackwell (SM10x/SM110) backend improvements close the gap to specialized external MoE kernels; the previous β€œkeep seqLen ≀ 16” guidance no longer applies. Refer to MoE (Mixture of Experts)

  • Rewritten Best Practices and Benchmarking guide β€” Reframed as a measure-then-optimize loop with side-by-side ONNX-TRT (trtexec) and Torch-TRT workflows in synchronized tabs covering quantization, dynamic shapes, CUDA graphs, profiling, and Nsight Systems timeline reading. Refer to Performance Benchmarking

  • Platform updates β€” RHEL 10 / Rocky Linux 10 RPM and tar packages, and a new TensorRT 10.x to 11.x migration path with dedicated DriveOS and Jetson/JetPack chapters

View 11.0.0 Release Notes

Previous Releases#

πŸ“‹ Release 10.16.1 Highlights
  • TensorRT 11.0.0 Coming Soon β€” New capabilities for PyTorch/Hugging Face integration, modernized APIs, removal of legacy weakly-typed APIs. Migrate early to Strongly Typed Networks, Explicit Quantization, and IPluginV3

  • JetPack Support for Orin iGPUs β€” Orin iGPU support via the ARM SBSA build, available as an early-access download ahead of JetPack 7.x

  • Safety Headers Included β€” Functional safety headers for ISO 26262-compliant applications are now included in all standard TensorRT packages

  • Interactive Sample Explorer β€” Browse all TensorRT samples by difficulty, language, or use case

  • Interactive Support Matrix β€” Filterable support matrix with three explorers for system requirements, hardware capabilities, and feature support

View 10.16.1 Release Notes

πŸ“¦ Archived Releases

Earlier TensorRT 10.x releases with key highlights:

πŸ“– Legacy Versions

Note

For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT GitHub Releases.