Is this page helpful?

NVIDIA TensorRT Documentation#

NVIDIA TensorRT is an SDK for optimizing and accelerating deep learning inference on NVIDIA GPUs. It takes trained models from frameworks such as PyTorch and intermediate formats such as ONNX, and optimizes them for high-performance deployment with support for mixed precision (FP32/FP16/BF16/FP8/INT8), dynamic shapes, and specialized optimizations for transformers and large language models (LLMs).

Quick Start#

🆕 New to NVIDIA TensorRT? → Start with the Quick Start Guide to build and deploy your first optimized inference engine in 30–60 minutes
⬆️ Upgrading from 10.15 or earlier? → Refer to What’s New in 10.16.0 below
🔧 Need help with a specific task? → Jump to the Installing TensorRT or Troubleshooting section

🆕 What’s New in NVIDIA TensorRT 10.16.0#

Latest Release Highlights

TensorRT 11.0 Coming Soon — New capabilities for PyTorch/Hugging Face integration, modernized APIs, removal of legacy weakly-typed APIs
Multi-Device Inference (Preview) — Scale inference across multiple GPUs with IDistCollectiveLayer and multi-device attention via NCCL
MoE (Mixture of Experts) — Built-in IMoELayer for transformer MoE blocks on SM110 with NVFP4/FP8 quantization
Interactive Sample Explorer — Browse all TensorRT samples by difficulty, language, or use case
Interactive Support Matrix — Filterable support matrix with three explorers for system requirements, hardware capabilities, and feature support; contains all 10.x releases
API Capture and Replay Multi-Network Support — Capture and replay multiple networks within a single process for ensemble models and multi-stage inference pipelines
Internal Library Path API — New setInternalLibraryPath API for custom builder resource locations
Breaking ABI Changes — Windows DLL files moved from lib/ to bin/ subdirectory; libonnx_proto.a merged into libnvonnxparser_static.a

View 10.16.0 Release Notes

What You’ll Find Here#

🚀 Getting Started - Quick start guide, release notes, and platform support matrix
📦 Installing TensorRT - Installation requirements, prerequisites, and step-by-step setup instructions
🏗️ Architecture - TensorRT design overview, optimization capabilities, and how the inference engine works
🔧 Inference Library - C++ and Python APIs, interactive sample explorer, and advanced features like quantization and dynamic shapes
⚡ Performance - Best practices for optimization and using trtexec for benchmarking
📚 API - Complete API references for C++, Python, ONNX GraphSurgeon, and Polygraphy tools
📖 Reference - Troubleshooting guides, operator support, command-line tools, and glossary

Previous Releases#

Note

For complete version history and detailed changelogs, visit the Release Notes section or the TensorRT GitHub Releases.