NVIDIA TensorRT Documentation#

NVIDIA TensorRT is an SDK for optimizing and accelerating deep learning inference on NVIDIA GPUs. It takes trained models from frameworks such as PyTorch, TensorFlow, and ONNX, and optimizes them for high-performance deployment with support for mixed precision (FP32/FP16/BF16/FP8/INT8/FP4/INT4), dynamic shapes, and specialized optimizations for transformers and large language models (LLMs).

Quick Start#

πŸ†• What’s New in NVIDIA TensorRT 11.1.0#

Release Highlights

  • CUDA 13.3 dependency upgrade: Updated CUDA Toolkit baseline across Linux x86-64, Windows x64, and SBSA platforms

  • Ubuntu 26.04 support: Adds Ubuntu 26.04 LTS to the supported Linux x86-64 and SBSA platform lists alongside the existing Ubuntu 22.04/24.04 packages

  • Python 3.14 bindings: Extends the Python wheel matrix to Python 3.14 on supported platforms

  • NVFP4 dual-GEMM fusion for SM121: Fuses the gate and up projection GEMMs in NVFP4 MoE/MLP blocks on NVIDIA DGX Spark (compute capability 12.1)

  • Global Performance Tuner: Automates trtexec build-route search to explore internal builder knobs, benchmark candidate engines, and optionally validate accuracy before selecting the fastest valid route. Refer to Global Performance Tuning.

View 11.1.0 Release Notes

Previous Releases#

πŸ“‹ Release 11.0.0 Highlights
  • Strongly typed networks are now the default: Weak-typing APIs (setPrecision, setDynamicRange, the per-precision BuilderFlag family) and implicit quantization (IInt8Calibrator) have been removed. Use the NVIDIA TensorRT Migration Guide to plan your upgrade

  • IPluginV2 has been removed: The entire IPluginV2 family is gone; migrate custom plugins to IPluginV3 with addPluginV3(). Refer to the V2 β†’ V3 walkthrough for a side-by-side API mapping

  • Multi-Device Inference is generally available: Preview flag retired, plus new AllToAll, Gather, and Scatter collective ops, automatic NCCL library fallback, and a new context-parallel attention sample. Refer to Multi-Device Inference

  • Ragged batching for attention: IAttention and IKVCacheUpdateLayer now support packed (kPACKED_NHD) layouts so variable-length sequences can be concatenated end-to-end without padding. Refer to Fused Attention

  • MoE inference performance: Significant Blackwell (SM10x/SM110) backend improvements close the gap to specialized external MoE kernels; the previous β€œkeep seqLen ≀ 16” guidance no longer applies. Refer to MoE (Mixture of Experts)

  • Rewritten Best Practices and Benchmarking guide: Reframed as a measure-then-optimize loop with side-by-side ONNX-TRT (trtexec) and Torch-TRT workflows in synchronized tabs covering quantization, dynamic shapes, CUDA graphs, profiling, and Nsight Systems timeline reading. Refer to Performance Benchmarking

  • Platform updates: RHEL 10 / Rocky Linux 10 RPM and tar packages, and a new TensorRT 10.x to 11.x migration path with dedicated DriveOS and Jetson/JetPack chapters

View 11.0.0 Release Notes

πŸ“¦ Archived Releases

Earlier TensorRT releases with key highlights:

πŸ“– Legacy Versions

Note

For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT GitHub Releases.