Precision Control#

This page covers how TensorRT selects algorithms and how you control precision. It includes algorithm selection and reproducible builds, strongly typed networks, reduced precision in weakly-typed networks, and control of computational precision.

Algorithm Selection and Reproducible Builds#

The default behavior of TensorRT’s optimizer is to choose the algorithms that globally minimize the execution time of the engine. It does this by timing each implementation, and sometimes, when implementations have similar timings, system noise can determine which one is chosen on any particular run of the builder. Different implementations will typically use different orders of accumulation of floating point values, and two implementations can use different algorithms or even run at different precisions. As a result, different invocations of the builder will typically not result in engines that return bit-identical results.

When the engine is being built for the first time, you supply the BuilderFlag::kEDITABLE_TIMING_CACHE flag to TensorRT to enable the editable cache. At the same time, you enable and retain the logs and cache files. The logs will provide the name, key, available tactics, and the selected tactic for each model layer. The cache file will record the decisions made by TensorRT.

Next time the same engine is being built, you supply the same flags to TensorRT and use the interface ITimingCache::update to update the cache. Specifically, select tactics for some layers. Then, pass the cache to TensorRT. In the building process, TensorRT will use the newly assigned tactic. Unlike before, in the new version, only one tactic can be assigned to each layer.

Strongly Typed Networks#

Since 11.0, TensorRT supports only Strongly Typed networks, meaning that the model must specify the precision for all tensors explicitly. This can be done for example using AutoCast or Quantization from Model-Optimizer, and TensorRT will adhere to the precision specifications. TensorRT will still autotune over different data layouts to find an optimal set of kernels for the network. Refer to the NVIDIA TensorRT Migration Guide if you are migrating from a Weakly Typed network.

TensorRT infers a type for each intermediate and output tensor using the rules in the operator type specification. TensorRT adheres to these inferred types while building the engine.

You can create a strongly typed network as follows:

1IBuilder* builder = ...;
2INetworkDefinition* network = builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kSTRONGLY_TYPED)))
1builder = trt.Builder(...)
2builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED))

For strongly typed networks, the layer APIs setPrecision and setOutputType are not permitted, nor are the builder precision flags kFP16, kBF16, kFP8, kINT8, kINT4, and kFP4. The builder flag kTF32 is permitted as it controls TF32 Tensor Core usage for FP32 types rather than controlling the use of TF32 data types.