Troubleshooting#

The following sections help answer the most commonly asked questions regarding typical use cases.

For more assitance, refer to your support engineer or post your questions on the NVIDIA Developer Forum for troubleshooting support.

FAQs#

This section is to help troubleshoot the problem and answer our most asked questions.

Q: How do I create an optimized engine for several batch sizes?

A: While TensorRT allows an engine optimized for a given batch size to run at any smaller size, the performance for those smaller sizes cannot be as well optimized. To optimize for multiple batch sizes, create optimization profiles at the dimensions assigned to OptProfilerSelector::kOPT.

Q: Are calibration tables portable across TensorRT versions?

A: No. Internal implementations are continually optimized and can change between versions. For this reason, calibration tables are not guaranteed to be binary compatible with different versions of TensorRT. Applications must build new INT8 calibration tables when using a new version of TensorRT.

Q: Are engines portable across TensorRT versions?

A: By default, no. Refer to the Version Compatibility section for instructions on configuring engines for forward compatibility.

Q: How do I choose the optimal workspace size?

A: Some TensorRT algorithms require additional workspace on the GPU. The method IBuilderConfig::setMemoryPoolLimit() controls the maximum amount of workspace that can be allocated and prevents algorithms that require more workspace from being considered by the builder. At runtime, the space is allocated automatically when creating an IExecutionContext. The amount allocated is no more than is required, even if the amount set in IBuilderConfig::setMemoryPoolLimit() is much higher. Applications should, therefore, allow the TensorRT builder as much workspace as they can afford; at runtime, TensorRT allocates no more than this and typically less. The workspace size may need to be limited to less than the full device memory size if device memory is needed for other purposes during the engine build.

Q: How do I use TensorRT on multiple GPUs?

A: Each ICudaEngine object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext is bound to the same GPU as the engine from which it was created. When calling execute() or enqueue(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.

Q: How do I get the version of TensorRT from the library file?

A: There is a symbol in the symbol table named tensorrt_version_#_#_#_# which contains the TensorRT version number. One possible way to read this symbol on Linux is to use the nm command like in the following example:

$ nm -D libnvinfer.so.* | grep tensorrt_version
00000000abcd1234 B tensorrt_version_#_#_#_#

Q: What can I do if my network produces the wrong answer?

A: There are several reasons why your network can be generating incorrect answers. Here are some troubleshooting approaches that can help diagnose the problem:

Turn on VERBOSE-level messages from the log stream and check what TensorRT is reporting.
Check that your input preprocessing generates exactly the input format the network requires.
If you are using reduced precision, run the network in FP32. If it produces the correct result, lower precision may have an insufficient dynamic range for the network.
Try marking intermediate tensors in the network as outputs and verify if they match your expectations.

Note

Marking tensors as outputs can inhibit optimizations and, therefore, can change the results.

You can use NVIDIA Polygraphy to assist you with debugging and diagnosis.

Q: How do I implement batch normalization in TensorRT?

A: Batch normalization can be implemented using a sequence of IElementWiseLayer in TensorRT. More specifically:

adjustedScale = scale / sqrt(variance + epsilon)
batchNorm = (input + bias - (adjustedScale * mean)) * adjustedScale

Q: Why does my network run slower when using DLA than without DLA?

A: DLA was designed to maximize energy efficiency. Depending on the features supported by DLA and the features supported by the GPU, either implementation can be more performant. Your chosen implementation depends on your latency or throughput requirements and power budget. Since all DLA engines are independent of the GPU and each other, you could also use both implementations to increase the throughput of your network further.

Q: Does TensorRT support INT4 quantization or INT16 quantization?

A: TensorRT supports INT4 quantization for GEMM weight-only quantization. TensorRT does not support INT16 quantization.

Q: When will TensorRT support my network in the UFF parser require layer XYZ?

A: UFF is deprecated. We recommend users switch their workflows to ONNX. The TensorRT ONNX parser is an open-source project.

Q: Can I use multiple TensorRT builders to compile on different targets?

A: TensorRT assumes that all resources for the device it is building on are available for optimization purposes. Concurrent use of multiple TensorRT builders (for example, multiple trtexec instances) to compile on different targets (DLA0, DLA1, and GPU) can oversubscribe system resources causing undefined behavior (meaning, inefficient plans, builder failure, or system instability).

Using trtexec with the -saveEngine argument, it is recommended to compile for different targets (DLA and GPU) separately and save their plan files. Such plan files can then be reused for loading (using trtexec with the --loadEngine argument) and submitting multiple inference jobs on the respective targets (DLA0, DLA1, and GPU). This two-step process alleviates over-subscription of system resources during the build phase while also allowing execution of the plan file to proceed without interference by the builder.

Q: Which layers are accelerated by Tensor Cores?

A: Most math-bound operations will be accelerated with tensor cores - convolution, deconvolution, fully connected, and matrix multiply. In some cases, particularly for small channel counts or small group sizes, another implementation may be faster and be selected instead of a tensor core implementation.

Q: Why are reformatting layers observed, although there is “”no warning message that no implementation obeys reformatting-free rules”?

A: Reformat-free network I/O does not mean reformatting layers are not inserted into the entire network. Only the input and output network tensors can be configured not to require reformatting layers; in other words, TensorRT can insert reformatting layers for internal tensors to improve performance.

Understanding Error Messages#

If an error occurs during execution, TensorRT reports an error message intended to help debug the problem. The following sections discuss some common error messages that developers can encounter.

ONNX Parser Error Messages

The following table captures the common ONNX parser error messages. For specific ONNX node support information, refer to the Operators’ support document.

Common ONNX Parser Error Messages#
Error Message	Description
`<X> must be an initializer!`	These error messages signify that an ONNX node input tensor is expected to be an initializer in TensorRT. A possible fix is to run constant folding on the model using TensorRT’s Polygraphy tool: `polygraphy surgeon sanitize model.onnx --fold-constants --output model_folded.onnx`
`!inputs.at(X).is_weights()`
`getPluginCreator() could not find Plugin <operator name> version 1`	This is an error stating that the ONNX parser does not have an import function defined for a particular operator and did not find a corresponding plugin in the loaded registry for the operator.

TensorRT Core Library Error Messages

The following table captures the common TensorRT core library error messages.

Common TensorRT Core Library Error Messages#
	Error Message	Description
Installation Errors	`CUDA initialization failure with error <code>. Please check the CUDA installation: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html.`	This error can occur if the CUDA or NVIDIA driver installation corrupts. Refer to the URL for instructions on installing CUDA and the NVIDIA driver on your OS.
Builder Errors	`Internal error: could not find any implementation for node <name>. Try increasing the workspace size with IBuilderConfig::setMemoryPoolLimit().`	This error occurs because there is no layer implementation for the given node in the network that can operate with the given workspace size. This usually occurs because the workspace size is insufficient, but could also indicate a bug. If increasing the workspace size as suggested does not help, report a bug (refer to Reporting TensorRT Issues).
	`<layer-name>: (kernel\|bias) weights have non-zero count but null values` `<layer-name>: (kernel\|bias) weights have zero count but non-null values`	This error occurs when a mismatch between the values and count fields in a Weights data structure is passed to the builder. If the count is 0, the values field must contain a null pointer; otherwise, the count must be non-zero, and values must contain a non-null pointer.
	`Builder was created on a device different from the current device.` \| This error can occur if you create an `IBuilder` targeting one GPU, called `cudaSetDevice()`, to target a different GPU and then attempt to use the `IBuilder` to create an engine. Ensure you only use the `IBuilder` when targeting the GPU used to create the `IBuilder`.
	You can encounter error messages indicating that the tensor dimensions do not match the semantics of the given layer. Carefully read the documentation on NvInfer.h on the usage of each layer and the expected dimensions of the tensor inputs and outputs to the layer.
INT8 Calibration Errors	`Tensor <X> is uniformly zero.`	This warning occurs and should be treated as an error when data distribution for a tensor is uniformly zero. In a network, the output tensor distribution can be uniformly zero under the following scenarios: - Constant tensor with all zero values; not an error. - Activation (ReLU) output with all negative inputs: not an error. - Data distribution is forced to all zero due to a computation error in the previous layer; emit a warning here. [1] - The user does not provide any calibration images; emit a warning here. [1]
INT8 Calibration Errors	`Could not find scales for tensor <X>.`	This error indicates that a calibration failure occurred with no scaling factors detected. This could be due to a lack of an INT8 calibrator or insufficient custom scales for network layers.
Engine Compatibility Errors	`The engine plan file is not compatible with this version of TensorRT, expecting (format\|library) version <X> got <Y>; please rebuild.`	This error can occur if you are running TensorRT using an engine PLAN file that is incompatible with the current version of TensorRT. Ensure you use the same TensorRT version when generating and running the engine.
	`The engine plan file is generated on an incompatible device, expecting compute <X> to get compute <Y>; please rebuild.`	This error can occur if you build an engine on a device with a computing capability different from the device used to run the engine.
	`Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.`	This warning can occur if you build an engine on a device with the same computing capability but not identical to the engine running. As the warning indicates, using a device of the same model is highly recommended when generating the engine and deploying it to avoid compatibility issues.
Out Of Memory Errors	`GPU memory allocation failed during initialization of (tensor\|layer): <name> GPU memory.`	These error messages can occur if insufficient GPU memory is available to instantiate a TensorRT engine. Verify that the GPU has sufficient memory to contain the required layer weights and activation tensors.
	`Allocation failed during the deserialization of weights.`
	`GPU does not meet the minimum memory requirements to run this engine …`
FP16 Errors	`The network needs native FP16, and the platform does not have native FP16.`	This error message can occur if you attempt to deserialize an engine that uses FP16 arithmetic on a GPU that does not support FP16 arithmetic. You either must rebuild the engine without FP16 precision inference or upgrade your GPU to a model that supports FP16 precision inference.
Plugin Errors	`Custom layer <name> returned non-zero initialization.`	This error can occur if a plugin layer’s initialize() method returns a non-zero value. Refer to the implementation of that layer to debug this error further. For more information, refer to the NVIDIA TensorRT Operator’s.

Code Analysis Tools#

Compiler Sanitizers#

Google sanitizers are a set of code analysis tools.

Issues With `dlopen` And Address Sanitizer#

There is a known issue with sanitizers, which is documented here. When using dlopen on TensorRT under a sanitizer, there will be reports of memory leaks unless one of two solutions is adopted:

Do not call dlclose when running under the sanitizers.
Pass the flag RTLD_NODELETE to dlopen when running under sanitizers.

Issues with `dlopen` and Thread Sanitizer#

The thread sanitizer can list errors when using dlopen from multiple threads. To suppress this warning, create a file called tsan.supp and add the following to the file:

race::dlopen

When running applications under thread sanitizer, set the environment variable using:

export TSAN_OPTIONS=”suppressions=tsan.supp”

Issues with CUDA and Address Sanitizer#

The address sanitizer has a known issue with CUDA applications, which is documented here. To successfully run CUDA libraries such as TensorRT under the address sanitizer, add the option protect_shadow_gap=0 to the ASAN_OPTIONS environment variable.

A known bug in CUDA 11.4 can trigger mismatched allocation and free errors in the address sanitizer. To disable these errors, add alloc_dealloc_mismatch=0 to ASAN_OPTIONS.

Issues with Undefined Behavior Sanitizer#

UndefinedBehaviorSanitizer (UBSan) reports false positives with the -fvisibility=hidden option, as documented here. Add the -fno-sanitize=vptr option to avoid UBSan reporting such false positives.

Valgrind#

Valgrind is a framework for dynamic analysis tools that can automatically detect memory management and threading bugs in applications.

Some versions of Valgrind and glibc are affected by a bug, which causes false memory leaks to be reported when dlopen is used, which can generate spurious errors when running a TensorRT application under Valgrind’s memcheck tool. To work around this, add the following to a Valgrind suppressions file as documented here:

{
    Memory leak errors with dlopen
    Memcheck:Leak
    match-leak-kinds: definite
    ...
    fun:*dlopen*
    ...
}

A known bug in CUDA 11.4 can trigger mismatched allocation and free errors in Valgrind. To disable these errors, add the option --show-mismatched-frees=no to the Valgrind command line.

Compute Sanitizer#

When running a TensorRT application under compute-sanitizer, cuGetProcAddress can fail with error code 500 due to missing functions. This error can be ignored or suppressed with --report-api-errors no option. This is due to CUDA backward compatibility checking if a function is usable on the CUDA toolkit/driver combination. The functions are introduced later in CUDA but unavailable on the current platform.

Understanding Formats Printed in Logs#

In logs from TensorRT, formats are printed as a type followed by stride and vectorization information. For example:

Half(60,1:8,12,3)

Where:

Half indicates that the element type is DataType::kHALF, a 16-bit floating point
:8 indicates the format packs eight elements per vector and that vectorization is along the second axis.

The rest of the numbers are strides in units of vectors. For this tensor, the mapping of a coordinate (n,c,h,w) to an address is:

((half*)base_address) + (60*n + 1*floor(c/8) + 12*h + 3*w) * 8 + (c mod 8)

The 1: is common to NHWC formats. For example, here is another example of an NCHW format:

Int8(105,15:4,3,1)

The INT8 indicates that the element type is DataType::kINT8, and the :4 indicates a vector size of 4. For this tensor, the mapping of a coordinate (n,c,h,w) to an address is:

(int8_t*)base_address + (105*n + 15*floor(c/4) + 3*h + w) * 4 + (c mod 4)

Scalar formats have a vector size of 1. For brevity, printing omits the :1.

In general, the coordinates to address mappings have the following form:

(type*)base_address + (vec_coordinate · strides) * vec_size + vec_mod

Where:

the dot denotes an inner product
strides are the printed strides, that is, strides in units of vectors
vec_size is the number of elements per vector
vec_coordinate is the original coordinate with the coordinate along the vectorized axis divided by vec_size
vec_mod is the original coordinate along the vectorized axis modulo vec_size

Reporting TensorRT Issues#

If you encounter issues when using TensorRT, check the FAQs and the Understanding Error Messages sections to look for similar failing patterns. For example, many engine building failures can be solved by sanitizing and constant-folding the ONNX model using Polygraphy with the following command:

polygraphy surgeon sanitize model.onnx --fold-constants --output model_folded.onnx

In addition, it is highly recommended that you first try our latest TensorRT release before filing an issue if you have not done so since it may have been fixed in the latest release.

Channels for TensorRT Issue Reporting#

If neither the FAQs nor the Understanding Error Messages sections help, you can report the issue through the NVIDIA Developer Forum or the TensorRT GitHub Issue page. These channels are constantly monitored to provide feedback on the issues you encounter.

Here are the steps to report an issue on the NVIDIA Developer Forum:

Register for the NVIDIA Developer website.
Log in to the developer site.
Click on your name in the upper right corner.
Click My Account > My Bugs and select Submit a New Bug.
Fill out the bug reporting page. Be descriptive and provide the steps to reproduce the problem.
Click Submit a bug.

When reporting an issue, provide setup details and include the following information:

Environment information:
- OS or Linux distro and version
- GPU type
- NVIDIA driver version
- CUDA version
- cuDNN version
- Python version (if Python is used).
- TensorFlow, PyTorch, and ONNX versions (if any of them are used).
- TensorRT version
- NGC TensorRT container version (if TensorRT container is used).
- Jetson (if used), include OS and hardware versions
A thorough description of the issue.
Steps to reproduce the issue:
- ONNX file (if ONNX is used).
- Minimal commands or scripts to trigger the issue
- Verbose logs by enabling kVERBOSE in ILogger

Depending on the type of the issue, providing more information listed below can expedite the response and debugging process.

Reporting a Functional Issue#

When reporting functional issues, such as linker errors, segmentation faults, engine building failures, inference failures, and so on, provide the scripts and commands to reproduce the issue and a detailed description of the environment. Having more details helps us debug the functional issue faster.

Since the TensorRT engine is specific to a specific TensorRT version and a specific GPU type, do not build the engine in one environment and use it to run it in another environment with different GPUs or dependency software stack, such as TensorRT version, CUDA version, cuDNN version, and so on. Also, ensure the application is linked to the correct TensorRT and cuDNN shared object files by checking the environment variable LD_LIBRARY_PATH (or %PATH% on Windows).

Reporting an Accuracy Issue#

When reporting an accuracy issue, provide the scripts and the commands used to calculate the accuracy metrics. Describe the expected accuracy level and share the steps to get the expected results using other frameworks like ONNX-Runtime.

The Polygraphy tool can debug the accuracy issue and produce a minimal failing case. For instructions, refer to the documentation on Debugging TensorRT Accuracy Issues. Having a Polygraphy command that shows the accuracy issue or having a minimal failing case expedites the time it takes for us to debug your accuracy issue.

Note that it is not practical to expect bitwise identical results between TensorRT and other frameworks like PyTorch, TensorFlow, or ONNX-Runtime even in FP32 precision since the order of the computations on the floating-point numbers can result in slight differences in output values. In practice, small numeric differences should not significantly affect the accuracy metric of the application, such as the mAP score for object-detection networks or the BLEU score for translation networks. If you see a significant drop in the accuracy metric between TensorRT and other frameworks such as PyTorch, TensorFlow, or ONNX-Runtime, it may be a genuine TensorRT bug.

If you are seeing NaNs or infinite values in TensorRT engine output when FP16/BF16 precision is enabled, it is possible that intermediate layer outputs in the network overflow in FP16/BF16. Some approaches to help mitigate this include:

Ensuring that network weights and inputs are restricted to a reasonably narrow range (such as [-1, 1] instead of [-100, 100]). This may require making changes to the network and retraining.
- Consider pre-processing input by scaling or clipping it to the restricted range before passing it to the network for inference.
Overriding precision for individual layers vulnerable to overflows (for example, Reduce and Element-Wise Power ops) to FP32.

Polygraphy can help you diagnose common problems by using reduced precision. Refer to Polygraphy’s Working with Reduced Precision how-to guide for more information.

For possible solutions to accuracy issues, refer to the Improving Model Accuracy section and the Working with Quantized Types section for instructions about using INT8/FP8 precision.

Reporting a Performance Issue#

If you are reporting a performance issue, share the full trtexec logs using this command:

trtexec --onnx=<onnx_file> <precision_and_shape_flags> --verbose --profilingVerbosity=detailed --dumpLayerInfo --dumpProfile --separateProfileRun --useCudaGraph --noDataTransfers --useSpinWait --duration=60

The verbose logs help us to identify the performance issue. If possible, also share the Nsight Systems profiling files using these commands:

trtexec --onnx=<onnx_file> <precision_and_shape_flags> --verbose --profilingVerbosity=detailed --dumpLayerInfo --saveEngine=<engine_path>
nsys profile --cuda-graph-trace=node -o <output_profile> trtexec --loadEngine=<engine_path> <precision_and_shape_flags> --useCudaGraph --noDataTransfers --useSpinWait --warmUp=0 --duration=0 --iterations=20

Refer to the trtexec section for more instructions on using the trtexec tool and the meaning of these flags.

If you do not use trtexec to measure performance, provide the scripts and commands you use to measure it. Compare the performance measurement from your script with that from the trtexec tool. If the two numbers differ, your scripts may have some issues with the performance measurement methodology.

Refer to the Hardware/Software Environment for Performance Measurements section for some environmental factors affecting performance.

Footnotes