FAQs#

This section provides answers to the most frequently asked questions about TensorRT. Questions are organized by topic for easy navigation.

How do I create an optimized engine for several batch sizes?

While TensorRT allows an engine optimized for a given batch size to run at any smaller size, the performance for those smaller sizes cannot be as well optimized. To optimize for multiple batch sizes, create optimization profiles at the dimensions assigned to OptProfileSelector::kOPT. Refer to Optimization Profiles.

Are engines portable across TensorRT versions?

By default, no. Refer to the Version Compatibility section for instructions on configuring engines for forward compatibility.

How do I choose the optimal workspace size?

Some TensorRT algorithms require additional workspace on the GPU. The method IBuilderConfig::setMemoryPoolLimit() controls the maximum amount of workspace that can be allocated and prevents algorithms that require more workspace from being considered by the builder. At runtime, the space is allocated automatically when creating an IExecutionContext. The amount allocated is no more than is required, even if the amount set in IBuilderConfig::setMemoryPoolLimit() is much higher. Applications should, therefore, allow the TensorRT builder as much workspace as they can afford; at runtime, TensorRT allocates no more than this and typically less. The workspace size can need to be limited to less than the full device memory size if device memory is needed for other purposes during the engine build.

How do I use TensorRT on multiple GPUs?

TensorRT supports parallelizing workloads across multiple GPUs. For more information, refer to the Inference Library Overview.

You may also use TensorRT on a single, specific GPU when multiple are available. Each ICudaEngine object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext is bound to the same GPU as the engine from which it was created. When calling enqueueV3(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.

How do I get the version of TensorRT from the library file?

There is a symbol in the symbol table named tensorrt_version_#_#_#_# that contains the TensorRT version number. One possible way to read this symbol on Linux is to use the nm command like in the following example:

$ nm -D libnvinfer.so.* | grep tensorrt_version
00000000abcd1234 B tensorrt_version_#_#_#_#

What can I do if my network produces the wrong answer?

There are several reasons why your network can be generating incorrect answers. Here are some troubleshooting approaches that can help diagnose the problem:

Turn on VERBOSE-level messages from the log stream and check what TensorRT is reporting.
Check that your input preprocessing generates exactly the input format the network requires.
If you are using reduced precision, run the network in FP32. If it produces the correct result, lower precision can have an insufficient dynamic range for the network.
Try marking intermediate tensors in the network as outputs and verify if they match your expectations.
Use NVIDIA Nsight Deep Learning Designer to inspect your compiled engine.

Note

Marking tensors as outputs can inhibit optimizations and, therefore, can change the results.

You can use NVIDIA Polygraphy to assist you with debugging and diagnosis.

How do I implement batch normalization in TensorRT?

Batch normalization can be implemented using a sequence of IElementWiseLayer in TensorRT. More specifically:

adjustedScale = scale / sqrt(variance + epsilon)
batchNorm = (input + bias - (adjustedScale * mean)) * adjustedScale

Why does my network run slower when using DLA than without DLA?

DLA is not supported in TensorRT 11.0, 11.1, or 11.2. The guidance below applies to supported earlier releases (TensorRT 10.7 was the last release that supported DLA). Refer to Working with DLA.

DLA was designed to maximize energy efficiency. Depending on the features supported by DLA and the features supported by the GPU, either implementation can be more performant. Your chosen implementation depends on your latency or throughput requirements and power budget. Since all DLA engines are independent of the GPU and each other, you could also use both implementations to increase the throughput of your network further.

Does TensorRT support INT4 quantization or INT16 quantization?

TensorRT supports INT4 quantization for GEMM weight-only quantization. TensorRT does not support INT16 quantization.

What should I do if the ONNX parser does not support a layer my network requires?

The TensorRT ONNX parser is an open-source project. You can add support for custom operators through TensorRT plugins.

Can I use multiple TensorRT builders to compile on different targets?

Warning

TensorRT assumes that all resources for the device it is building on are available for optimization purposes. Concurrent use of multiple TensorRT builders (such as multiple trtexec instances) to compile on different targets (DLA0, DLA1, and GPU) can oversubscribe system resources causing undefined behavior (meaning, inefficient plans, builder failure, or system instability).

Using trtexec with the --saveEngine argument, it is recommended to compile for different targets (DLA and GPU) separately and save their plan files. Such plan files can then be reused for loading (using trtexec with the --loadEngine argument) and submitting multiple inference jobs on the respective targets (DLA0, DLA1, and GPU). This two-step process alleviates over-subscription of system resources during the build phase while also allowing execution of the plan file to proceed without interference by the builder.

Which layers are accelerated by Tensor Cores?

Most math-bound operations will be accelerated with tensor cores - convolution, deconvolution, fully connected, and matrix multiply. In some cases, particularly for small channel counts or small group sizes, another implementation can be faster and be selected instead of a tensor core implementation.

Why are reformatting layers observed, although there is no warning message that no implementation obeys reformatting-free rules?

Reformat-free network I/O does not mean reformatting layers are not inserted into the entire network. Only the input and output network tensors can be configured not to allow reformatting layers; in other words, TensorRT can insert reformatting layers for internal tensors to improve performance.