Is this page helpful?

Version Compatibility#

By default, TensorRT engines are compatible only with the version of TensorRT used to build them. With appropriate build-time configuration, you can build engines that are compatible with later TensorRT versions.

For example, engines built with TensorRT 8 work with TensorRT 9 and TensorRT 10 runtimes. However, engines built with newer versions (TensorRT 9 or 10) do not work with older runtimes (TensorRT 8).

Note that version-compatible engines can be slower than engines built for a specific runtime version.

Version compatibility is supported from version 8.6; the plan must be built with a version at least 8.6 or higher, and the runtime must be 8.6 or higher.

When using version compatibility, the API supported at runtime for an engine is the intersection of the API supported in the version used to build it and the API of the version used to run it. TensorRT removes APIs only on major version boundaries, so this is not a concern within a major version.

However, if you want to use TensorRT 8 or TensorRT 9 engines with TensorRT 10, you must migrate away from removed APIs. We recommend avoiding deprecated APIs.

The recommended approach to creating a version-compatible engine is to build as follows:

C++

builderConfig.setFlag(BuilderFlag::kVERSION_COMPATIBLE);
IHostMemory* plan = builder->buildSerializedNetwork(network, config);

Python

builder_config.set_flag(tensorrt.BuilderFlag.VERSION_COMPATIBLE)
plan = builder.build_serialized_network(network, config)

The request for a version-compatible engine causes a copy of the lean runtime to be added to the plan. When you deserialize the plan, TensorRT will recognize that it contains a runtime copy. It loads the runtime to deserialize and execute the rest of the plan. Because this results in code being loaded and run from the plan in the context of the owning process, you should only deserialize trusted plans this way. To indicate to TensorRT that you trust the plan, call:

C++

runtime->setEngineHostCodeAllowed(true);

Python

runtime.engine_host_code_allowed = True

The flag for trusted plans is also required if you are packaging plugins in the plan. For more information, refer to the Plugin Shared Libraries section.

Manually Loading the Runtime#

The previous approach (Version Compatibility) packages a copy of the runtime with every plan, which can be prohibitive if your application uses many models. An alternative approach is to manage the runtime loading yourself. For this approach, build version-compatible plans as explained in the previous section, but also set an additional flag to exclude the lean runtime.

C++

builderConfig.setFlag(BuilderFlag::kVERSION_COMPATIBLE);
builderConfig.setFlag(BuilderFlag::kEXCLUDE_LEAN_RUNTIME);
IHostMemory* plan = builder->buildSerializedNetwork(network, config);

Python

builder_config.set_flag(tensorrt.BuilderFlag.VERSION_COMPATIBLE)
builder_config.set_flag(tensorrt.BuilderFlag.EXCLUDE_LEAN_RUNTIME)
plan = builder.build_serialized_network(network, config)

To run this plan, you must have access to the lean runtime for the version with which it was built. Suppose you have built the plan with TensorRT 8.6, and your application is linked against TensorRT 10. You can load the plan as follows.

C++

IRuntime* v10Runtime = createInferRuntime(logger);
IRuntime* v8ShimRuntime = v10Runtime->loadRuntime(v8RuntimePath);
engine = v8ShimRuntime->deserializeCudaEngine(v8plan);

Python

v10_runtime = tensorrt.Runtime(logger)
v8_shim_runtime = v10_runtime.load_runtime(v8_runtime_path)
engine = v8_shim_runtime.deserialize_cuda_engine(v8_plan)

The runtime will translate TensorRT 10 API calls for the TensorRT 8.6 runtime, checking to ensure that the call is supported and performing any necessary parameter remapping.

Loading from Storage#

TensorRT can load the shared runtime library directly from memory on most OSs. However, on Linux kernels before 3.17, a temporary directory is required. Use the IRuntime::setTempfileControlFlags and IRuntime::setTemporaryDirectory APIs to control TensorRT’s use of these mechanisms.

Using Version Compatibility with the ONNX Parser#

When building a version-compatible engine from a TensorRT network definition generated using TensorRT’s ONNX parser, you must specify that the parser must use the native InstanceNormalization implementation instead of the plugin one.

To do this, use the IParser::setFlag() API.

C++

auto *parser = nvonnxparser::createParser(network, logger);
parser->setFlag(nvonnxparser::OnnxParserFlag::kNATIVE_INSTANCENORM);

Python

parser = trt.OnnxParser(network, logger)
parser.set_flag(trt.OnnxParserFlag.NATIVE_INSTANCENORM)

In addition, the parser can require plugins to fully implement all ONNX operators used in the network. In particular, if the network is used to build a version-compatible engine, some plugins can need to be included (either serialized with the engine or provided externally and explicitly loaded).

To query the list of plugin libraries needed to implement a particular parsed network, use the IParser::getUsedVCPluginLibraries API:

C++

auto *parser = nvonnxparser::createParser(network, logger);
parser->setFlag(nvonnxparser::OnnxParserFlag::kNATIVE_INSTANCENORM);
parser->parseFromFile(filename, static_cast<int>(ILogger::Severity::kINFO));
int64_t nbPluginLibs;
char const* const* pluginLibs = parser->getUsedVCPluginLibraries(nbPluginLibs);

Python

parser = trt.OnnxParser(network, logger)
parser.set_flag(trt.OnnxParserFlag.NATIVE_INSTANCENORM)

status = parser.parse_from_file(filename)
plugin_libs = parser.get_used_vc_plugin_libraries()

Refer to the Plugin Shared Libraries section for instructions on using the resulting library list to serialize the plugins or package them externally.

Hardware Compatibility#

By default, TensorRT engines are only compatible with the type of device where they were built. With build-time configuration, engines that are compatible with other types of devices can be built. Currently, hardware compatibility is not supported on NVIDIA DRIVE OS or JetPack.

NVIDIA Ampere GPU Architecture (and Later) Compatibility Level#

To build an engine compatible with all Ampere and newer architectures, configure the IBuilderConfig as follows:

config->setHardwareCompatibilityLevel(nvinfer1::HardwareCompatibilityLevel::kAMPERE_PLUS);

When building in hardware compatibility mode, TensorRT excludes tactics that are not hardware compatible, such as those that use architecture-specific instructions or require more shared memory than is available on some devices. As a result, a hardware-compatible engine can have lower throughput and/or higher latency than its non-hardware-compatible counterpart. The degree of this performance impact depends on the network architecture and input sizes.

Same Compute Capability Compatibility Level#

TensorRT now supports a new kSAME_COMPUTE_CAPABILITY hardware compatibility level. This option allows you to build engines that are compatible with GPUs having the same Compute Capability as the one on which the engine was built.

To configure an engine for compatibility with GPUs of the same compute capability, use the following code:

config->setHardwareCompatibilityLevel(nvinfer1::HardwareCompatibilityLevel::kSAME_COMPUTE_CAPABILITY);

This option ensures the engine is compatible with GPUs that have the same compute capability as the build device. While it can result in lower performance compared to an engine with no compatibility restrictions, it enables broader deployment across devices. Generally, kSAME_COMPUTE_CAPABILITY can provide better performance than kAMPERE_PLUS. It also allows usage of new hardware features like FP8 and FP4 precision.

Compatibility Checks#

TensorRT records the major, minor, patch, and build versions of the library used to create the plan in a plan. If these do not match the runtime version used to deserialize the plan, it will fail to deserialize. When using version compatibility, the check will be performed by the lean runtime deserializing the plan data. By default, that lean runtime is included in the plan, and the match is guaranteed to succeed.

TensorRT also records the compute capability (major and minor versions) in the plan and checks it against the GPU on which the plan is being loaded. If they do not match, the plan will fail to deserialize. This ensures that kernels selected during the build phase are present and can run. When using hardware compatibility, the check is relaxed; with HardwareCompatibilityLevel::kAMPERE_PLUS, the check will ensure that the compute capability is greater than or equal to 8.0 rather than checking for an exact match.

TensorRT additionally checks the following properties and will issue a warning if they do not match, except when using hardware compatibility:

Global memory bus width
L2 cache size
Maximum shared memory per block and multiprocessor.
Texture alignment requirement
Number of multiprocessors
Whether the GPU device is integrated or discrete

If GPU clock speeds differ between engine serialization and runtime systems, the tactics chosen by the serialization system cannot be optimal for the runtime system and can incur some performance degradation.

If it is impossible to build a TensorRT engine for each type of GPU, you can select several GPUs to build engines with and run the engine on different GPUs with the same architecture. For example, among the NVIDIA RTX 40xx GPUs, you can build an engine with RTX 4080 and an engine with RTX 4060. At runtime, you can use the RTX 4080 engine on an RTX 4090 GPU and the 4060 engine on an RTX 4070 GPU. In most cases, the engine will run without functional issues and with only a small performance drop compared to running the engine built with the same GPU.

However, deserialization can only succeed if the engine requires a large amount of device memory and the memory available is smaller than when the engine was built. In this case, it is recommended to build the engine on a smaller GPU or on a larger device with limited compute resources.

The safety runtime can deserialize engines generated in an environment where the major, minor, patch, and build versions of TensorRT do not match exactly in some cases. For more information, refer to the NVIDIA DRIVE OS Developer Guide and the NVIDIA TensorRT Safety Production Guide for DriveOS for any safety-related activities.