TensorRT’s Capabilities#
This section provides an overview of what you can do with TensorRT. It is intended to be useful to all TensorRT users.
C++ and Python APIs#
TensorRT’s API has language bindings for both C++ and Python, with nearly identical capabilities. The API facilitates interoperability with Python data processing toolkits and libraries like NumPy and SciPy. The C++ API can be more efficient and may better meet some compliance requirements, for example, in automotive applications.
Note
The Python API is not available for all platforms. For more information, refer to the Support Matrix.
The Programming Model#
TensorRT operates in two phases. In the first phase, usually performed offline, you provide TensorRT with a model definition, and TensorRT optimizes it for a target GPU. In the second phase, you use the optimized model to run inference.
The Build Phase#
The highest-level interface for the build phase of TensorRT is the Builder (C++, Python). The builder is responsible for optimizing a model and producing an Engine.
To build an engine, you must:
Create a network definition.
Specify a configuration for the builder.
Call the builder to create the engine.
The NetworkDefinition interface (C++, Python) defines the model. The most common path to transfer a model to TensorRT is to export it from a framework in ONNX format and use TensorRT’s ONNX parser to populate the network definition. However, you can also construct the definition step by step using TensorRT’s Layer (C++, Python) and Tensor (C++, Python) interfaces.
Whichever way you choose, you must also define which tensors are the inputs and outputs of the network. Tensors that are not marked as outputs are considered to be transient values that the builder can optimize away. Input and output tensors must be named so that TensorRT knows how to bind the input and output buffers to the model at runtime.
The BuilderConfig
interface (C++, Python) is used to specify how TensorRT should optimize the model. Among the configuration options available, you can control TensorRT’s ability to reduce the precision of calculations, control the tradeoff between memory and runtime execution speed, and constrain the choice of CUDA kernels. Since the builder can take minutes or more to run, you can also control how the builder searches for kernels and cached search results for use in subsequent runs.
After you have a network definition and a builder configuration, you can call the builder to create the engine. The builder eliminates dead computations, folds constants, reorders, and combines operations to run more efficiently on the GPU. It can reduce the precision of floating-point computations by simply running them in 16-bit floating point or by quantizing floating-point values so that calculations can be performed using 8-bit integers. It also has multiple implementations for each layer with varying data formats. Then, it computes an optimal schedule to execute the model, minimizing the combined cost of kernel executions and format transforms.
The builder creates the engine in a serialized form called a plan, which can be deserialized immediately or saved to disk for later use.
Note
By default, TensorRT engines are specific to both the TensorRT version and the GPU on which they were created. Refer to the Version Compatibility and Hardware Compatibility sections to configure an engine for forward compatibility.
TensorRT’s network definition does not have deep-copy parameter arrays (such as the weights for a convolution). Therefore, you must not release the memory for those arrays until the build phase is complete. When importing a network using the ONNX parser, the parser owns the weights, so it must not be destroyed until the build phase is complete.
The builder times algorithms to determine the fastest. Running the builder in parallel with other GPU work may perturb the timings, resulting in poor optimization.
The Runtime Phase#
The highest-level interface for the execution phase of TensorRT is the Runtime (C++, Python).
When using the runtime, you will typically carry out the following steps:
Deserialize a plan to create an engine.
Create an execution context from the engine.
Then, repeatedly:
Populate input buffers for inference.
Call
enqueueV3()
on the execution context to run inference.
The Engine interface (C++, Python) represents an optimized model. You can query an engine for information about the input and output tensors of the network - the expected dimensions, data type, data format, and so on.
The ExecutionContext interface (C++, Python), created from the engine, is the main interface for invoking inference. The execution context contains all of the states associated with a particular invocation - thus, you can have multiple contexts associated with a single engine and run them in parallel.
You must set up the input and output buffers in the appropriate locations when invoking inference. Depending on the nature of the data, this may be in either CPU or GPU memory. If not obvious, based on your model, you can query the engine to determine which memory space to provide the buffer.
After the buffers are set up, inference can be enqueued (enqueueV3
). The required kernels are enqueued on a CUDA stream, and control is returned to the application as soon as possible. Some networks require multiple control transfers between CPU and GPU, so control may take longer to return. To wait for completion of asynchronous execution, synchronize on the stream using cudaStreamSynchronize.
Plugins#
TensorRT has a Plugin interface that allows applications to implement operations that TensorRT does not support natively. While translating the network, the ONNX parser can find plugins created and registered with TensorRT’s PluginRegistry
.
TensorRT ships with a library of plugins; the source for many of these and some additional plugins can be found on GitHub: TensorRT plugin.
You can also write your plugin library and serialize it with the engine.
If cuDNN or cuBLAS is needed, install the library as TensorRT no longer ships with them or depends on them. To obtain cudnnContext*
or cublasContext*
, the corresponding TacticSource
flag must be set using nvinfer1::IBuilderConfig::setTacticSource()
.
Refer to the Extending TensorRT With Custom Layers section for more details.
Types and Precision#
Supported Types#
TensorRT supports FP32, FP16, BF16, FP8, FP4, INT4, INT8, INT32, INT64, UINT8, and BOOL data types. Refer to the TensorRT Operator documentation for the layer I/O data type specification.
FP32, FP16, BF16: unquantized floating point types
INT8: low-precision integer type
Implicit quantization
Interpreted as a quantized integer. A tensor with INT8 type must have an associated scale factor (either through calibration or
setDynamicRange
API).
Explicit quantization
Interpreted as a signed integer. Conversion to/from INT8 type requires an explicit Q/DQ layer.
INT4: low-precision integer type for weight compression
INT4 is used for weight-only-quantization. Requires dequantization before computing is performed.
Conversion to and from INT4 type requires an explicit Q/DQ layer.
INT4 weights are expected to be serialized by packing two elements per byte. For additional information, refer to the Quantized Weights section.
FP8: low-precision floating-point type
8-bit floating point type with 1-bit for sign, 4-bits for exponent, 3-bits for mantissa
Conversion to/from the FP8 type requires an explicit Q/DQ layer.
FP4: narrow-precision floating-point type
4-bit floating point type with 1-bit for sign, 2-bits for exponent, 1-bit for mantissa
Conversion to/from the FP4 type requires an explicit Q/DQ layer.
FP4 weights are expected to be serialized by packing two elements per byte. For additional information, refer to the Quantized Weights section.
UINT8: unsigned integer I/O type
The data type is only usable as a network I/O type.
Network-level inputs in UINT8 must be converted from UINT8 to FP32 or FP16 using a
CastLayer
before the data is used in other operations.Network-level outputs in UINT8 must be produced by a
CastLayer
explicitly inserted into the network (will only support conversions from FP32/FP16 to UINT8).UINT8 quantization is not supported.
The
ConstantLayer
does not support UINT8 as an output type.
BOOL
A boolean type is used with supported layers.
Strong Typing vs Weak Typing#
When providing a network to TensorRT, you specify whether it is strongly or weakly typed, with weakly typed as the default.
For strongly typed networks, TensorRT’s optimizer will statically infer intermediate tensor types based on the network input types and the operator specifications, which match type inference semantics in frameworks. The optimizer will then adhere strictly to those types. For additional information, refer to the Strongly Typed Networks section.
TensorRT’s optimizer may substitute different precisions for tensors for weakly typed networks if it increases performance. In this mode, TensorRT defaults to FP32 for all floating-point operations, but there are two ways to configure different levels of precision:
To control precision at the model level,
BuilderFlag
options (C++, Python) can indicate to TensorRT that it may select lower-precision implementations when searching for the fastest (and because lower precision is generally faster, it typically will).For example, by setting a single flag, you can easily instruct TensorRT to use FP16 calculations for your entire model. For regularized models whose input dynamic range is approximately one, this typically produces significant speedups with negligible change in accuracy.
For finer-grained control, where a layer must run at higher precision because part of the network is numerically sensitive or requires high dynamic range, arithmetic precision can be specified for that layer.
For additional information, refer to the Reduced Precision in Weakly-Typed Networks section.
Quantization#
TensorRT supports quantized floating points, where floating-point values are linearly compressed and rounded to low-precision quantized types (INT8, FP8, INT4, FP4). This significantly increases arithmetic throughput while reducing storage requirements and memory bandwidth. When quantizing a floating-point tensor, TensorRT must know its dynamic range - what range of values is important to represent - values outside this range are clamped when quantizing.
The builder can calculate dynamic range information (calibration) based on representative input data (currently supported only for INT8). Alternatively, you can perform quantization-aware training in a framework and import the model to TensorRT with the necessary dynamic range information.
For additional information, refer to the Working with Quantized Types section.
Tensors and Data Formats#
When defining a network, TensorRT assumes that multidimensional C-style arrays represent tensors. Each layer has a specific interpretation of its inputs: for example, a 2D convolution will assume that the last three dimensions of its input are in CHW format - there is no option to use, for example, a WHC format. Refer to the TensorRT Operator documentation for how each layer interprets inputs.
Note that tensors are limited to at most 2^31-1 elements.
While optimizing the network, TensorRT performs transformations internally (including to HWC, but also more complex formats) to use the fastest possible CUDA kernels. Formats are generally chosen to optimize performance, and applications cannot control the choices. However, the underlying data formats are exposed at I/O boundaries (network input and output and passing data to and from plugins) to allow applications to minimize unnecessary format transformations.
For additional information, refer to the I/O Formats section.
Dynamic Shapes#
By default, TensorRT optimizes the model based on the input shapes (batch size, image size, and so on) at which it was defined. However, the builder can be configured to adjust the input dimensions at runtime. To enable this, specify one or more instances of OptimizationProfile
(C++, Python) in the builder configuration, containing a minimum and maximum shape for each input and an optimization point within that range.
TensorRT creates an optimized engine for each profile, choosing CUDA kernels that work for all shapes within the [minimum, maximum] range and are fastest for the optimization point - typically different kernels for each profile. You can then select among profiles at runtime.
For additional information, refer to the Working With Dynamic Shapes section.
DLA#
TensorRT supports NVIDIA’s Deep Learning Accelerator (DLA), a dedicated inference processor on many NVIDIA SoCs that supports a subset of TensorRT’s layers. TensorRT allows you to execute part of the network on the DLA and the rest on GPU; for layers that can be executed on either device, you can select the target device in the builder configuration on a per-layer basis.
For additional information, refer to the Working With DLA section.
Updating Weights#
When building an engine, you can specify that its weights may later be updated. This can be useful if you frequently update the model’s weights without changing the structure, such as in reinforcement learning or when retraining a model while retaining the same structure. Weight updates are performed using the Refitter
(C++, Python) interface.
For additional information, refer to the Refitting an Engine section.
Streaming Weights#
TensorRT can be configured to stream the network’s weights from host memory to device memory during network execution instead of placing them in device memory at engine load time. This enables models with weights larger than free GPU memory to run, but potentially with significantly increased latency. Weight streaming is an opt-in feature at both build time (BuilderFlag::kWEIGHT_STREAMING
) and runtime (ICudaEngine::setWeightStreamingBudgetV2
).
Note
Weight streaming is only supported with strongly typed networks. For additional information, refer to the Weight Streaming section.
trtexec
#
Included in the samples
directory is a command-line wrapper tool called trtexec
. trtexec
is a tool that allows you to use TensorRT without developing your application. The trtexec
tool has three main purposes:
benchmarking networks on random or user-provided input data
generating serialized engines from models.
generating a serialized timing cache from the builder.
For additional information, refer to the trtexec section.
Polygraphy#
Polygraphy is a toolkit designed to assist in running and debugging deep learning models in TensorRT and other frameworks. It includes a Python API and a command-line interface (CLI) built using this API.
Among other things, with Polygraphy, you can:
Run inference among multiple backends, like TensorRT and ONNX-Runtime, and compare results (API, CLI).
Convert models to formats like TensorRT engines with post-training quantization (API, CLI).
View information about various types of models (CLI).
Modify ONNX models on the command line:
For additional information, refer to the Polygraphy repository.