Inference Library Overview#

This section documents how to build TensorRT engines, run inference with the C++ and Python APIs, and apply advanced runtime features for production deployment on NVIDIA GPUs. Use this overview to choose the right guide for your integration stage.

About the Inference Library#

The Inference Library covers the builder and runtime workflow — network definition, engine serialization, execution contexts, and inference — plus specialized guides for quantization, dynamic shapes, custom plugins, control flow, DLA, transformers, and debugging tools.

For your first engine, start with Build Your First Engine and the Quick Start Guide. For installation and platform support, see Installation Guide Overview and Support Matrix. For runtime object lifetimes and threading, refer to How TensorRT Works.

Integration Path Overview#

  1. Understand capabilities — Build/runtime model, precision support, and links to deeper topics → TensorRT’s Capabilities

  2. Follow language-specific walkthroughs — Step-by-step ONNX import, engine build, and inference → C++ API Documentation or Python API Documentation

  3. Explore samples — Build and run shipped samples with Sample Explorer → Sample Support Guide

  4. Apply advanced configuration — Engine compatibility, refitting, precision, formats, and multi-device → Advanced Topics

  5. Optimize accuracy and performance — Quantization, dynamic shapes, transformers, and tuning → topics in What’s in This Section below

What’s in This Section#

This inference library is organized into the following guides:

TensorRT’s Capabilities

Overview of the build and runtime model, supported precisions, refitting, and links to specialized topics.

TensorRT’s Capabilities

C++ API Documentation

Detailed C++ workflow: network creation, ONNX import, engine build, deserialization, and inference.

C++ API Documentation

Python API Documentation

Python equivalents for parsing ONNX, building engines, and executing inference.

Python API Documentation

Sample Support Guide

Sample Explorer, build and run instructions, and cross-compiling guidance.

Sample Support Guide

Advanced Topics

Engine compatibility, refitting, precision control, tensor formats, engine tools, weight streaming, and multi-device inference.

Advanced Topics

Work With Quantized Types

Explicit quantization, Q/DQ networks, PTQ, and QAT workflows.

Work With Quantized Types

Accuracy Considerations

Reduced-precision trade-offs, determinism, and numerical debugging.

Accuracy Considerations

Work With Dynamic Shapes

Optimization profiles, shape tensors, and data-dependent outputs.

Work With Dynamic Shapes

Extending TensorRT with Custom Layers

Plugin V3 authoring, registration, and advanced plugin patterns.

Extending TensorRT with Custom Layers

Work With Loops

ILoop networks for recurrent and iterative subgraphs.

Work With Loops

Work With Conditionals

IIfConditional networks for data-dependent branching.

Work With Conditionals

Work With DLA

Deep Learning Accelerator deployment, formats, and standalone mode. DLA is not supported in TensorRT 11.0 or 11.1.

Work With DLA

TensorRT API Capture and Replay

Record and replay engine-building API sequences for debugging.

TensorRT API Capture and Replay

Work With Transformers

Fused attention, KV cache, MoE, and transformer-specific optimizations.

Work With Transformers

For ONNX export paths and deployment workflows, refer to Quick Start Guide. For measure-then-optimize performance guidance, see Performance Best Practices.