How TensorRT Works#

This section provides more detail on how TensorRT works.

Object Lifetimes#

TensorRT’s API is class-based, with some classes acting as factories for other classes. For objects owned by the user, the lifetime of a factory object must span the lifetime of objects it creates. For example, the NetworkDefinition and BuilderConfig classes are created from the Builder class, and objects of those classes should be destroyed before the Builder factory object.

An important exception to this rule is creating an engine from a builder. After creating an engine, you may destroy the builder, network, parser, and build config and continue using the engine.

Error Handling and Logging#

When creating TensorRT top-level interfaces (builder, runtime, or refitter), you must provide an implementation of the Logger (C++, Python) interface. The logger is used for diagnostics and informational messages; its verbosity level is configurable. Since the logger may pass back information at any point in TensorRT’s lifetime, it must span any use of that interface in your application. The implementation must also be thread-safe since TensorRT may use worker threads internally.

An API call to an object will use the logger associated with the corresponding top-level interface. For example, in a call to ExecutionContext::enqueueV3(), the execution context was created from an engine created from a runtime, so TensorRT will use the logger associated with that runtime.

The primary method of error handling is the ErrorRecorder (C++, Python) interface. You can implement this interface and attach it to an API object to receive errors associated with it. The recorder for an object will also be passed to any others it creates - for example, if you attach an error recorder to an engine and create an execution context from that engine, it will use the same recorder. If you then attach a new error recorder to the execution context, it will receive only errors from that context. If an error is generated but no error recorder is found, it will be emitted through the associated logger.

Note that CUDA errors are generally asynchronous - so when performing multiple inferences or other streams of CUDA work asynchronously in a single CUDA context, an asynchronous GPU error may be observed in a different execution context than the one that generated it.

Memory#

TensorRT uses considerable amounts of device memory (that is, memory directly accessible by the GPU, as opposed to the host memory attached to the CPU). Since device memory is often a constrained resource, it is important to understand how TensorRT uses it.

The Build Phase#

During the build, TensorRT allocates device memory for timing layer implementations. Some implementations can consume a lot of temporary memory, especially with large tensors. You can control the maximum amount of temporary memory through the memory pool limits of the builder config. The workspace size defaults to the full size of the device’s global memory but can be restricted when necessary. If the builder finds applicable kernels that could not be run because of insufficient workspace, it will emit a logging message indicating this.

Even with relatively little workspace, however, timing requires creating buffers for input, output, and weights. TensorRT is robust against the operating system (OS) returning out-of-memory for such allocations. On some platforms, the OS may successfully provide memory, and then the out-of-memory killer process observes that the system is low on memory and kills TensorRT. If this happens, free up as much system memory as possible before retrying.

During the build phase, at least two copies of the weights will typically be in host memory: those from the original network and those included as part of the engine as it is built. In addition, when TensorRT combines weights (for example, convolution with batch normalization), additional temporary weight tensors will be created.

The Runtime Phase#

TensorRT uses relatively little host memory at runtime but can use considerable device memory.

An engine allocates device memory to store the model weights upon deserialization. Since the serialized engine has almost all the weights, its size approximates the amount of device memory the weights require.

An ExecutionContext uses two kinds of device memory:

  • Some layer implementations require persistent memory - for example, some convolution implementations use edge masks, and this state cannot be shared between contexts as weights are because its size depends on the layer input shape, which may vary across contexts. This memory is allocated to the creation of the execution context and lasts for its lifetime.

  • Enqueue memory is used to hold intermediate results while processing the network. This memory is used for intermediate activation tensors (called activation memory). It is also used for temporary storage required by layer implementations (called scratch memory), the bound for which is controlled by IBuilderConfig::setMemoryPoolLimit(). TensorRT optimizes memory usage in a couple of ways:

    • By sharing a block of device memory across activations tensors with disjoint lifetimes.

    • Where feasible, allow transient (scratch) tensors to occupy unused activation memory. Therefore, enqueue memory required by TensorRT is in the range of {total activation memory, total activation memory + max scratch memory}.

You may optionally create an execution context without enqueue memory using ICudaEngine::createExecutionContextWithoutDeviceMemory() and provide that memory for the duration of network execution. This allows you to share it between multiple contexts that are not running concurrently or for other uses while inference is not running. The amount of enqueue memory required is returned by ICudaEngine::getDeviceMemorySizeV2().

Information about the amount of persistent memory and scratch memory used by the execution context is emitted by the builder when building the network at severity kINFO. Examining the log, the messages look similar to the following:

[08/12/2021-17:39:11] [I] [TRT] Total Host Persistent Memory: 106528
[08/12/2021-17:39:11] [I] [TRT] Total Device Persistent Memory: 29785600
[08/12/2021-17:39:11] [I] [TRT] Max Scratch Memory: 9970688

By default, TensorRT allocates device memory directly from CUDA. However, you can attach an implementation of TensorRT’s IGpuAllocator (C++, Python) interface to the builder or runtime and manage device memory yourself. This is useful if your application wants to control all GPU memory and sub-allocate to TensorRT instead of having TensorRT allocated directly from CUDA.

NVIDIA cuDNN and NVIDIA cuBLAS can occupy large amounts of device memory. TensorRT allows you to control whether these libraries are used for inference using the builder configuration’s TacticSources (C++, Python) attribute. Some plugin implementations require these libraries, so the network may not be compiled successfully when they are excluded. If the appropriate tactic sources are set, the cudnnContext and cublasContext handles are passed to the plugins using IPluginV2Ext:::attachToContext().

The CUDA infrastructure and TensorRT’s device code also consume device memory. The amount of memory varies by platform, device, and TensorRT version. You can use cudaGetMemInfo to determine the total amount of device memory.

TensorRT measures the memory used before and after critical operations in the builder and runtime. These memory usage statistics are printed to TensorRT’s information logger. For example:

[MemUsageChange] Init CUDA: CPU +535, GPU +0, now: CPU 547, GPU 1293 (MiB)

It indicates that memory use changes with CUDA initialization. CPU +535, GPU +0 is the increased amount of memory after running CUDA initialization. The content after now: is the CPU/GPU memory usage snapshot after CUDA initialization.

Note

In a multi-tenant situation, the reported memory use by cudaGetMemInfo and TensorRT is prone to race conditions, where a new allocation/free is done by a different process or thread. Since CUDA does not control memory on unified-memory devices, the results returned by cudaGetMemInfo may not be accurate on these platforms.

CUDA Lazy Loading#

CUDA lazy loading is a CUDA feature that can significantly reduce the peak GPU and host memory usage of TensorRT and speed up TensorRT initialization with negligible (< 1%) performance impact. The memory usage and time-saving for initialization depend on the model, software stack, GPU platform, etc. It is enabled by setting the environment variable CUDA_MODULE_LOADING=LAZY. For more information, refer to the NVIDIA CUDA documentation.

L2 Persistent Cache Management#

NVIDIA Ampere and later architectures support L2 cache persistence, a feature that allows prioritization of L2 cache lines for retention when a line is chosen for eviction. TensorRT can use this to retain cache activations, reducing DRAM traffic and power consumption.

Cache allocation is per-execution context, enabled using the context’s setPersistentCacheLimit method. The total persistent cache among all contexts (and other components using this feature) should not exceed cudaDeviceProp::persistingL2CacheMaxSize. For more information, refer to the NVIDIA CUDA Best Practices Guide.

Threading#

TensorRT objects are generally not thread-safe; the client must serialize access to objects from different threads.

The expected runtime concurrency model is that different threads operate in different execution contexts. The context contains the network state (activation values and so on) during execution, so using a context concurrently in different threads results in undefined behavior.

To support this model, the following operations are thread-safe:

  • Nonmodifying operations on a runtime or engine.

  • Deserializing an engine from a TensorRT runtime.

  • Creating an execution context from an engine.

  • Registering and deregistering plugins.

There are no thread-safety issues with using multiple builders in different threads; however, the builder uses timing to determine the fastest kernel for the parameters provided, and using multiple builders with the same GPU will perturb the timing and TensorRT’s ability to construct optimal engines. There are no such issues using multiple threads to build with different GPUs.

Determinism#

The TensorRT builder uses timing to find the fastest kernel to implement a given layer. Timing kernels are subject to noise, such as other work running on the GPU, GPU clock speed fluctuations, etc. Timing noise means the same implementation may not be selected on successive builder runs.

In general, different implementations will use a different order of floating point operations, resulting in small differences in the output. The impact of these differences on the final result is usually very small. However, when TensorRT is configured to optimize by tuning over multiple precisions, the difference between an FP16 and an FP32 kernel can be more significant, particularly if the network has not been well regularized or is otherwise sensitive to numerical drift.

Other configuration options that can result in a different kernel selection are different input sizes (for example, batch size) or a different optimization point for an input profile (refer to the Working With Dynamic Shapes section).

The Editable Timing Cache mechanism allows you to force the builder to pick a particular implementation for a given layer. You can use this to ensure the builder picks the same kernels from run to run. For more information, refer to the Algorithm Selection and Reproducible Builds section.

After an engine has been built, except for IFillLayer and IScatterLayer, it is deterministic: providing the same input in the same runtime environment will produce the same output.

IFillLayer Determinism#

When IFillLayer is added to a network using the RANDOM_UNIFORM or RANDOM_NORMAL operations, the determinism guarantee above is no longer valid. On each invocation, these operations generate tensors based on the RNG state and then update the RNG state. This state is stored on a per-execution context basis.

IScatterLayer Determinism#

If IScatterLayer is added to a network, and the input tensor indices have duplicate entries, the determinism guarantee above is not valid for ScatterMode::kELEMENT and ScatterMode::kND modes. Additionally, one of the values from the input updates tensor will be picked arbitrarily.

Runtime Options#

TensorRT provides multiple runtime libraries to meet a variety of use cases. C++ applications that run TensorRT engines should link against one of the following:

  • The default runtime is the main library (libnvinfer.so/.dll).

  • The lean runtime library (libnvinfer_lean.so/.dll) is much smaller than the default library and contains only the code necessary to run a version-compatible engine. It has some restrictions; primarily, it cannot refit or serialize engines.

  • The dispatch runtime (libnvinfer_dispatch.so/.dll) is a small shim library that can load a lean runtime and redirect calls. The dispatch runtime can load older versions of the lean runtime and, together with the appropriate configuration of the builder, can be used to provide compatibility between a newer version of TensorRT and an older plan file. Using the dispatch runtime is almost the same as manually loading the lean runtime. Still, it checks that APIs are implemented by the lean runtime loaded and performs some parameter mapping to support API changes where possible.

The lean runtime contains fewer operator implementations than the default runtime. Since TensorRT chooses operator implementations at build time, you must specify that the engine should be built for the lean runtime by enabling version compatibility. It may be slower than an engine built for the default runtime.

The lean runtime contains all the functionality of the dispatch runtime, and the default runtime contains all the functionality of the lean runtime.

TensorRT provides Python packages corresponding to each of the above libraries:

tensorrt

A Python package. It is the Python interface for the default runtime.

tensorrt_lean

A Python package. It is the Python interface for the lean runtime.

tensorrt_dispatch

A Python package. It is the Python interface for the dispatch runtime.

Python applications that run TensorRT engines should import one of the above packages to load the appropriate library for their use case.

Compatibility#

By default, serialized engines are only guaranteed to work correctly when used with the same OS, CPU architectures, GPU models, and TensorRT versions used to serialize the engines. Refer to the Version Compatibility and Hardware Compatibility sections to relax the constraints on TensorRT versions and GPU models.