Is this page helpful?

Inference Library Overview#

This section documents how to build TensorRT engines, run inference with the C++ and Python APIs, and apply advanced runtime features for production deployment on NVIDIA GPUs. Use this overview to choose the right guide for your integration stage.

About the Inference Library#

The Inference Library covers the builder and runtime workflow: network definition, engine serialization, execution contexts, and inference, plus specialized guides for quantization, dynamic shapes, custom plugins, control flow, DLA, transformers, and debugging tools.

For your first engine, start with Build Your First Engine and the Quick Start Guide. For installation and platform support, refer to Installation Guide Overview and Support Matrix. For runtime object lifetimes and threading, refer to How TensorRT Works.

Integration Path Overview#

Understand capabilities: Build/runtime model, precision support, and links to deeper topics → TensorRT’s Capabilities
Plan engine compatibility: Version and hardware compatibility, compatibility checks → Engine Compatibility
Follow language-specific walkthroughs: Step-by-step ONNX import, engine build, and inference → C++ API Documentation or Python API Documentation
Explore samples: Build and run shipped samples with Sample Explorer → Sample Support Guide
Apply advanced configuration: Refitting, precision, formats, engine tools, weight streaming, and multi-device → Advanced Topics
Optimize accuracy and performance: Quantization, dynamic shapes, transformers, and tuning → topics in What’s in This Section below

What’s in This Section#

This inference library is organized into the following guides:

TensorRT’s Capabilities

Overview of the build and runtime model, supported precisions, refitting, and links to specialized topics.

→ TensorRT’s Capabilities

Engine Compatibility

Version and hardware compatibility, compatibility checks, and cross-platform engine deployment.

→ Engine Compatibility

C++ API Documentation

Detailed C++ workflow: network creation, ONNX import, engine build, deserialization, and inference.

→ C++ API Documentation

Python API Documentation

Python equivalents for parsing ONNX, building engines, and executing inference.

→ Python API Documentation

Sample Support Guide

Sample Explorer, build and run instructions, and cross-compiling guidance.

→ Sample Support Guide

Advanced Topics

Refitting, precision control, tensor formats, engine tools, weight streaming, and multi-device inference.

→ Advanced Topics

Work With Quantized Types

Explicit quantization, Q/DQ networks, PTQ, and QAT workflows.

→ Work With Quantized Types

Accuracy Considerations

Reduced-precision trade-offs, determinism, and numerical debugging.

→ Accuracy Considerations

Work With Dynamic Shapes

Optimization profiles, shape tensors, and data-dependent outputs.

→ Work With Dynamic Shapes

Extending TensorRT with Custom Layers

Plugin V3 authoring, registration, and advanced plugin patterns.

→ Extending TensorRT with Custom Layers

Work With Loops

ILoop networks for recurrent and iterative subgraphs.

→ Work With Loops

Work With Conditionals

IIfConditional networks for data-dependent branching.

→ Work With Conditionals

Work With DLA

Deep Learning Accelerator deployment, formats, and standalone mode. DLA is not supported in TensorRT 11.0, 11.1, or 11.2.

→ Work With DLA

TensorRT API Capture and Replay

Record and replay engine-building API sequences for debugging.

→ TensorRT API Capture and Replay

Work With Transformers

Fused attention, KV cache, MoE, and transformer-specific optimizations.

→ Work With Transformers

For ONNX export paths and deployment workflows, refer to Quick Start Guide. For measure-then-optimize performance guidance, refer to Performance Best Practices.