Inference Library Overview#
This section documents how to build TensorRT engines, run inference with the C++ and Python APIs, and apply advanced runtime features for production deployment on NVIDIA GPUs. Use this overview to choose the right guide for your integration stage.
About the Inference Library#
The Inference Library covers the builder and runtime workflow — network definition, engine serialization, execution contexts, and inference — plus specialized guides for quantization, dynamic shapes, custom plugins, control flow, DLA, transformers, and debugging tools.
For your first engine, start with Build Your First Engine and the Quick Start Guide. For installation and platform support, see Installation Guide Overview and Support Matrix. For runtime object lifetimes and threading, refer to How TensorRT Works.
Integration Path Overview#
Understand capabilities — Build/runtime model, precision support, and links to deeper topics → TensorRT’s Capabilities
Follow language-specific walkthroughs — Step-by-step ONNX import, engine build, and inference → C++ API Documentation or Python API Documentation
Explore samples — Build and run shipped samples with Sample Explorer → Sample Support Guide
Apply advanced configuration — Engine compatibility, refitting, precision, formats, and multi-device → Advanced Topics
Optimize accuracy and performance — Quantization, dynamic shapes, transformers, and tuning → topics in What’s in This Section below
What’s in This Section#
This inference library is organized into the following guides:
TensorRT’s Capabilities
Overview of the build and runtime model, supported precisions, refitting, and links to specialized topics.
C++ API Documentation
Detailed C++ workflow: network creation, ONNX import, engine build, deserialization, and inference.
Python API Documentation
Python equivalents for parsing ONNX, building engines, and executing inference.
Sample Support Guide
Sample Explorer, build and run instructions, and cross-compiling guidance.
Advanced Topics
Engine compatibility, refitting, precision control, tensor formats, engine tools, weight streaming, and multi-device inference.
Work With Quantized Types
Explicit quantization, Q/DQ networks, PTQ, and QAT workflows.
Accuracy Considerations
Reduced-precision trade-offs, determinism, and numerical debugging.
Work With Dynamic Shapes
Optimization profiles, shape tensors, and data-dependent outputs.
Extending TensorRT with Custom Layers
Plugin V3 authoring, registration, and advanced plugin patterns.
Work With Loops
ILoopnetworks for recurrent and iterative subgraphs.
Work With Conditionals
IIfConditionalnetworks for data-dependent branching.
Work With DLA
Deep Learning Accelerator deployment, formats, and standalone mode. DLA is not supported in TensorRT 11.0 or 11.1.
TensorRT API Capture and Replay
Record and replay engine-building API sequences for debugging.
Work With Transformers
Fused attention, KV cache, MoE, and transformer-specific optimizations.
For ONNX export paths and deployment workflows, refer to Quick Start Guide. For measure-then-optimize performance guidance, see Performance Best Practices.