Advanced Topics#

Hardware Compatibility#

By default, TensorRT engines are only compatible with the type of device where they were built. With build-time configuration, engines that are compatible with other types of devices can be built. Currently, hardware compatibility is supported only for Ampere and later device architectures and is not supported on NVIDIA DRIVE OS or JetPack.

For example, to build an engine compatible with all Ampere and newer architectures, configure the IBuilderConfig as follows:

config->setHardwareCompatibilityLevel(nvinfer1::HardwareCompatibilityLevel::kAMPERE_PLUS);

When building in hardware compatibility mode, TensorRT excludes tactics that are not hardware compatible, such as those that use architecture-specific instructions or require more shared memory than is available on some devices. Thus, a hardware-compatible engine may have lower throughput and/or higher latency than its non-hardware-compatible counterpart. The degree of this performance impact depends on the network architecture and input sizes.

Compatibility Checks#

TensorRT records the major, minor, patch, and build versions of the library used to create the plan in a plan. If these do not match the runtime version used to deserialize the plan, it will fail to deserialize. When using version compatibility, the check will be performed by the lean runtime deserializing the plan data. By default, that lean runtime is included in the plan, and the match is guaranteed to succeed.

TensorRT also records the compute capability (major and minor versions) in the plan and checks it against the GPU on which the plan is being loaded. If they do not match, the plan will fail to deserialize. This ensures that kernels selected during the build phase are present and can run. When using hardware compatibility, the check is relaxed; with HardwareCompatibilityLevel::kAMPERE_PLUS, the check will ensure that the compute capability is greater than or equal to 8.0 rather than checking for an exact match.

TensorRT additionally checks the following properties and will issue a warning if they do not match, except when using hardware compatibility:

Global memory bus width
L2 cache size
Maximum shared memory per block and multiprocessor
Texture alignment requirement
Number of multiprocessors
Whether the GPU device is integrated or discrete

If GPU clock speeds differ between engine serialization and runtime systems, the tactics chosen by the serialization system may not be optimal for the runtime system and may incur some performance degradation.

If it is impossible to build a TensorRT engine for each type of GPU, you can select several GPUs to build engines with and run the engine on different GPUs with the same architecture. For example, among the NVIDIA RTX 40xx GPUs, you can build an engine with RTX 4080 and an engine with RTX 4060. At runtime, you can use the RTX 4080 engine on an RTX 4090 GPU and the 4060 engine on an RTX 4070 GPU. In most cases, the engine will run without functional issues and with only a small performance drop compared to running the engine built with the same GPU.

However, deserialization may only succeed if the engine requires a large amount of device memory and the memory available is smaller than when the engine was built. In this case, it is recommended to build the engine on a smaller GPU or on a larger device with limited compute resources.

The safety runtime can deserialize engines generated in an environment where the major, minor, patch, and build versions of TensorRT do not match exactly in some cases. For more information, refer to the NVIDIA DRIVE OS 6.5 Developer Guide.

Algorithm Selection and Reproducible Builds#

The default behavior of TensorRT’s optimizer is to choose the algorithms that globally minimize the execution time of the engine. It does this by timing each implementation, and sometimes, when implementations have similar timings, system noise may determine which will be chosen on any particular run of the builder. Different implementations will typically use different orders of accumulation of floating point values, and two implementations may use different algorithms or even run at different precisions. Thus, different invocations of the builder will typically not result in engines that return bit-identical results.

Sometimes, it is important to have a deterministic build or recreate an earlier build’s algorithm choices. In the previous version of TensorRT, the above requirements were met by implementing IAlgorithmSelector. In the new version, the editable timing cache is used.

When the engine is being built for the first time, you supply the BuilderFlag::kEDITABLE_TIMING_CACHE flag to TensorRT to enable the editable cache. At the same time, you enable and retain the logs and cache files. The logs will provide the name, key, available tactics, and the selected tactic for each model layer. The cache file will record the decisions made by TensorRT.

Next time the same engine is being built, you supply the same flags to TensorRT and use the interface ITimingCache::update to update the cache. Specifically, select tactics for some layers. Then, pass the cache to TensorRT. In the building process, TensorRT will use the newly assigned tactic. Unlike before, in the new version, only one tactic can be assigned to each layer.

Strongly Typed Networks#

By default, TensorRT autotunes tensor types to generate the fastest engine. This can result in accuracy loss when model accuracy requires a layer to run with higher precision than TensorRT chooses. One approach is to use the ILayer::setPrecision and ILayer::setOutputType APIs to control a layer’s I/O types and, hence, its execution precision. This approach works, but figuring out which layers must be run at high precision to get the best accuracy can be challenging.

An alternative approach is to specify low precision use in the model, such as Automatic mixed precision training or quantization-aware training, and have TensorRT adhere to the precision specifications. TensorRT will still autotune over different data layouts to find an optimal set of kernels for the network.

When you specify to TensorRT that a network is strongly typed, it infers a type for each intermediate and output tensor using the rules in the operator type specification. Inferred types are adhered to while building the engine. As types are not autotuned, an engine built from a strongly typed network can be slower than one where TensorRT chooses tensor types. On the other hand, the build time may improve as fewer kernel alternatives are evaluated.

Strongly typed networks are not supported with DLA.

You can create a strongly typed network as follows:

C++

IBuilder* builder = ...;
INetworkDefinition* network = builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kSTRONGLY_TYPED)))

Python

builder = trt.Builder(...)
builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED))

For strongly typed networks, the layer APIs setPrecision and setOutputType are not permitted, nor are the builder precision flags kFP16, kBF16, kFP8, kINT8, kINT4, and kFP4. The builder flag kTF32 is permitted as it controls TF32 Tensor Core usage for FP32 types rather than controlling the use of TF32 data types.

Control of Computational Precision#

Sometimes, it is desirable to control the internal precision of the computation in addition to setting the input and output precisions for an operator. TensorRT selects the computational precision by default based on the layer input type and global performance considerations.

There are two layers where TensorRT provides additional capabilities to control computational precision:

The INormalizationLayer provides a setPrecision method to control the precision of accumulation. By default, to avoid overflow errors, TensorRT accumulates in FP32, even in mixed precision mode, regardless of builder flags. You can use this method to specify FP16 accumulation instead.

For the IMatrixMultiplyLayer, TensorRT, by default, selects accumulation precision based on the input types and performance considerations. However, the accumulation type is guaranteed to have a range at least as great as the input types. When using strongly typed mode, you can enforce FP32 precision for FP16 GEMMs by casting the inputs to FP32. TensorRT recognizes this pattern and fuses the casts with the GEMM, resulting in a single kernel with FP16 inputs and FP32 accumulation.

I/O Formats#

TensorRT optimizes a network using many different data formats. To allow efficient data passing between TensorRT and a client application, these underlying data formats are exposed at network I/O boundaries, for Tensors marked as network input or output, and when passing data to and from plugins. For other tensors, TensorRT picks formats that result in the fastest overall execution and may insert reformats to improve performance.

You can assemble an optimal data pipeline by profiling the available I/O formats in combination with the formats most efficient for the operations preceding and following TensorRT.

To specify I/O formats, you specify one or more formats as a bitmask.

The following example sets the input tensor format to TensorFormat::kHWC8. Note that this format only works for DataType::kHALF, so the data type must be set accordingly.

C++

auto formats = 1U << TensorFormat::kHWC8;
network->getInput(0)->setAllowedFormats(formats);
network->getInput(0)->setType(DataType::kHALF);

Python

formats = 1 << int(tensorrt.TensorFormat.HWC8)
network.get_input(0).allowed_formats = formats
network.get_input(0).dtype = tensorrt.DataType.HALF

Note that calling setAllowedFormats() or setType() on a tensor that is not a network input/output has no effect and is ignored by TensorRT.

sampleIOFormats illustrates how to specify I/O formats using C++.

The following table shows the supported formats.

Supported I/O Formats#
Format	`kINT32`	`kFLOAT`	`kHALF`	`kINT8`	`kBOOL`	`kUINT8`	`kINT64`	`BF16`	`FP8`	`FP4/INT4`
`kLINEAR`	Only for GPU	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
`kCHW2`	No	No	Only for GPU	No	No	No	No	Yes	No	No
`kCHW4`	No	No	Yes	Yes	No	No	No	Yes	No	No
`kHWC8`	No	No	Only for GPU	No	No	No	No	Only for GPU	No	No
`kCHW16`	No	No	Yes	No	No	No	No	No	No	No
`kCHW32`	No	Only for GPU	Only for GPU	Yes	No	No	No	No	No	No
`kDHWC8`	No	No	Only for GPU	No	No	No	No	Only for GPU	No	No
`kCDHW32`	No	No	Only for GPU	Only for GPU	No	No	No	No	No	No
`kHWC`	No	Only for GPU	No	No	No	Yes	No	No	No	No
`kDLA_LINEAR`	No	No	Only for DLA	Only for DLA	No	No	No	No	No	No
`kDLA_HWC4`	No	No	Only for DLA	Only for DLA	No	No	No	No	No	No
`kHWC16`	No	No	Only for NVIDIA Ampere GPUs and later	Only for GPU	No	No	No	No	Only for GPU	No
`kDHWC`	No	Only for GPU	No	No	No	No	No	No	No	No

Note that for the vectorized formats, the channel dimension must be zero-padded to the multiple of the vector size. For example, if an input binding has dimensions of [16,3,224,224], kHALF data type, and kHWC8 format, then the actual-required size of the binding buffer would be 16**224*224*sizeof(half) bytes, even though the engine->getBindingDimension() API will return tensor dimensions as [16,3,224,224]. The values in the padded part (that is, where C=3,4,…,7 in this example) must be filled with zeros.

Refer to the Data Format Descriptions section for how the data are laid out in memory for these formats.

Sparsity#

NVIDIA Ampere Architecture GPUs support Structured Sparsity. The weights must have at least 2 zeros in every four-entry vector to use this feature to achieve higher inference performance. For TensorRT, the requirements are:

For Convolution, for each output channel and each spatial pixel in the kernel weights, every four input channels must have at least two zeros. In other words, assuming that the kernel weights have the shape [K, C, R, S] and C % 4 == 0, then the requirement is verified using the following algorithm:

hasSparseWeights = True
for k in range(0, K):
    for r in range(0, R):
        for s in range(0, S):
            for c_packed in range(0, C // 4):
                if numpy.count_nonzero(weights[k, c_packed*4:(c_packed+1)*4, r, s]) > 2 :
                    hasSparseWeights = False

For MatrixMultiply, of which Constant produces an input, every four elements of the reduction axis (K) must have at least two zeros.

Polygraphy (polygraphy inspect sparsity) can detect whether the operation weights in an ONNX model follow the 2:4 structured sparsity pattern.

To enable the sparsity feature, set the kSPARSE_WEIGHTS flag in the builder config and make sure that kFP16 or kINT8 modes are enabled. For example:
C++
1config->setFlag(BuilderFlag::kSPARSE_WEIGHTS); 2config->setFlag(BuilderFlag::kFP16); 3config->setFlag(BuilderFlag::kINT8);
Python
1config.set_flag(trt.BuilderFlag.SPARSE_WEIGHTS) 2config.set_flag(trt.BuilderFlag.FP16) 3config.set_flag(trt.BuilderFlag.INT8)

At the end of the TensorRT logs, when the TensorRT engine is built, TensorRT reports which layers contain weights that meet the structured sparsity requirement and which layers TensorRT selects tactics that use the structured sparsity. Sometimes, tactics with structured sparsity can be slower than normal, and TensorRT will choose normal tactics. The following output shows an example of TensorRT logs showing information about sparsity:

[03/23/2021-00:14:05] [I] [TRT] (Sparsity) Found 3 layer(s) eligible to use sparse tactics: conv1, conv2, conv3
[03/23/2021-00:14:05] [I] [TRT] (Sparsity) Chose 2 layer(s) using sparse tactics: conv2, conv3

Forcing kernel weights to have structured sparsity patterns can lead to accuracy loss. Refer to the Automatic Sparsity tool in PyTorch section to recover lost accuracy with further fine-tuning.

To measure inference performance with structured sparsity using trtexec, refer to the trtexec section.

Empty Tensors#

TensorRT supports empty tensors. A tensor is an empty tensor if it has one or more dimensions with a length of zero. Zero-length dimensions usually get no special treatment. If a rule works for a dimension of length L for an arbitrary positive value of L, it usually works for L=0, too.

For example, when concatenating two tensors with dimensions [x,y,z] and [x,y,w] along the last axis, the result has dimensions [x,y,z+w], regardless of whether x, y, z, or w is zero.

Implicit broadcast rules remain unchanged since only unit-length dimensions are special for broadcast. For example, given two tensors with dimensions [1,y,z] and [x,1,z], their sum computed by IElementWiseLayer has dimensions [x,y,z], regardless of whether x, y, or z is zero.

If an engine binding is an empty tensor, it still needs a non-null memory address, and different tensors should have different addresses. This is consistent with the C++ rule that every object has a unique address. For example, new float[0] returns a non-null pointer. If using a memory allocator that might return a null pointer for zero bytes, ask for at least one byte instead.

Refer to the TensorRT Operator documentation for any special handling of empty tensors per layer.

Reusing Input Buffers#

TensorRT allows specifying a CUDA event to be signaled once the input buffers are free to be reused. This allows the application to immediately refill the input buffer region for the next inference in parallel with finishing the current inference. For example:

C++

context->setInputConsumedEvent(&inputReady);

Python

context.set_input_consumed_event(inputReady)

Engine Inspector#

TensorRT provides the IEngineInspector API to inspect the information inside a TensorRT engine. Call the createEngineInspector() from a deserialized engine to create an engine inspector, and then call getLayerInformation() or getEngineInformation() inspector APIs to get the information of a specific layer in the engine or the entire engine, respectively. You can print out the information of the first layer of the given engine, as well as the overall information of the engine, as follows:

C++

auto inspector = std::unique_ptr<IEngineInspector>(engine->createEngineInspector());
inspector->setExecutionContext(context); // OPTIONAL
std::cout << inspector->getLayerInformation(0, LayerInformationFormat::kJSON); // Print the information of the first layer in the engine.
std::cout << inspector->getEngineInformation(LayerInformationFormat::kJSON); // Print the information of the entire engine.

Python

inspector = engine.create_engine_inspector()
inspector.execution_context = context # OPTIONAL
print(inspector.get_layer_information(0, LayerInformationFormat.JSON)) # Print the information of the first layer in the engine.
print(inspector.get_engine_information(LayerInformationFormat.JSON)) # Print the information of the entire engine.

Note that the level of detail in the engine/layer information depends on the ProfilingVerbosity builder config setting when the engine is built. By default, ProfilingVerbosity is set to kLAYER_NAMES_ONLY, so only the layer names will be printed. If ProfilingVerbosity is set to kNONE, then no information will be printed; if it is set to kDETAILED, then detailed information will be printed.

Below are some examples of layer information printed by getLayerInformation() API depending on the ProfilingVerbosity setting:

kLAYER_NAMES_ONLY

"node_of_gpu_0/res4_0_branch2a_1 + node_of_gpu_0/res4_0_branch2a_bn_1 + node_of_gpu_0/res4_0_branch2a_bn_2"

kDETAILED

{
    "Name": "node_of_gpu_0/res4_0_branch2a_1 + node_of_gpu_0/res4_0_branch2a_bn_1 + node_of_gpu_0/res4_0_branch2a_bn_2",
    "LayerType": "CaskConvolution",
    "Inputs": [
    {
        "Name": "gpu_0/res3_3_branch2c_bn_3",
        "Dimensions": [16,512,28,28],
        "Format/Datatype": "Thirty-two wide channel vectorized row major Int8 format."
    }],
    "Outputs": [
    {
        "Name": "gpu_0/res4_0_branch2a_bn_2",
        "Dimensions": [16,256,28,28],
        "Format/Datatype": "Thirty-two wide channel vectorized row major Int8 format."
    }],
    "ParameterType": "Convolution",
    "Kernel": [1,1],
    "PaddingMode": "kEXPLICIT_ROUND_DOWN",
    "PrePadding": [0,0],
    "PostPadding": [0,0],
    "Stride": [1,1],
    "Dilation": [1,1],
    "OutMaps": 256,
    "Groups": 1,
    "Weights": {"Type": "Int8", "Count": 131072},
    "Bias": {"Type": "Float", "Count": 256},
    "AllowSparse": 0,
    "Activation": "RELU",
    "HasBias": 1,
    "HasReLU": 1,
    "TacticName": "sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize256x128x64_stage4_warpsize4x2x1_g1_tensor16x8x32_simple_t1r1s1_epifadd",
    "TacticValue": "0x11bde0e1d9f2f35d"
    }

In addition, when the engine is built with dynamic shapes, the dynamic dimensions in the engine information will be shown as -1, and the tensor format information will not be shown because these fields depend on the actual shape at the inference phase. To get the engine information for a specific inference shape, create an IExecutionContext, set all the input dimensions to the desired shapes, and then call inspector->setExecutionContext(context). After the context is set, the inspector will print the engine information for the specific shape set in the context.

The trtexec tool provides the --profilingVerbosity, --dumpLayerInfo, and --exportLayerInfo flags for getting engine information for a given engine. Refer to the trtexec section for more details.

Currently, only binding information and layer information, including the dimensions of the intermediate tensors, precisions, formats, tactic indices, layer types, and layer parameters, are included in the engine information. In future TensorRT versions, more information may be added to the engine inspector output as new keys in the output JSON object. More specifications about the keys and the fields in the inspector output will also be provided.

In addition, some subgraphs are handled by a next-generation graph optimizer that still needs to be integrated with the engine inspector. Therefore, the layer information within these layers has yet to be shown. This will be improved in a future version of TensorRT.

Engine graph visualization with Nsight Deep Learning Designer

When detailed TensorRT engine layer information is exported to a JSON file with the --exportLayerInfo option, the engine’s computation graph may be visualized with Nsight Deep Learning Designer. Open the application, and from the File menu, select Open File, then choose the .trt.json file containing the exported metadata.

Layers in a TensorRT Engine Generated from an Object Detection Network

The Layer Explorer window allows you to search for a particular layer or explore the layers in the network. The Parameter Editor window lets you view the selected layer’s metadata.

Optimizer Callbacks#

The optimizer callback API feature allows you to monitor the progress of the TensorRT build process, for example, to provide user feedback in interactive applications. To enable progress monitoring, create an object that implements the IProgressMonitor interface, then attach it to the IBuilderConfig, for example:

C++

builderConfig->setProgressMonitor(&monitor);

Python

context.set_progress_monitor(monitor)

Optimization is divided into hierarchically nested phases, each consisting of several steps. At the start of each phase, the phaseStart() method of IProgressMonitor is called, telling you the phase name and how many steps it has. The stepComplete() function is called when each step completes, and phaseFinish() is called when the phase finishes.

Returning false from stepComplete() cleanly forces the build to terminate early.

Preview Features#

The preview feature API is an extension of IBuilderConfig that allows the gradual introduction of new features to TensorRT. Selected new features are exposed under this API, allowing you to opt in or out. A preview feature remains in preview status for one or two TensorRT release cycles and is then either integrated as a mainstream feature or dropped. When a preview feature is fully integrated into TensorRT, it is no longer controllable through the preview API.

Preview features are defined using a 32-bit PreviewFeature enumeration. The feature name and the TensorRT version concatenate feature identifiers.

<FEATURE_NAME>_XXYY

XX and YY are the major and minor versions of the TensorRT release, respectively, which first introduced the feature. The major and minor versions are specified using two digits with leading-zero padding when necessary.

Suppose the semantics of a preview feature change from one TensorRT release to another. In that case, the older preview feature is deprecated, and the revised feature is assigned a new enumeration value and name.

Deprecated preview features are marked per the deprecation policy.

For more information about the C++ API, refer to nvinfer1::PreviewFeature, IBuilderConfig::setPreviewFeature, and IBuilderConfig::getPreviewFeature.

The Python API has similar semantics using the PreviewFeature enum set_preview_feature and get_preview_feature functions.

Debug Tensors#

The debug tensor feature allows you to inspect intermediate tensors as the network executes. There are a few key differences between using debug tensors and marking all required tensors as outputs:

Marking all tensors as outputs requires you to provide memory to store tensors in advance, while debug tensors can be turned off during runtime if unneeded.
When debug tensors are turned off, the performance impact on the execution of the network is minimized.
For a debug tensor in a loop, values are emitted every time it is written.

To enable this feature, perform the following steps:

Mark the target tensors before the network is compiled.

C++

networkDefinition->markDebug(&tensor);

Python

network.mark_debug(tensor)

Define a DebugListener class deriving from IDebugListener and implement the virtual function for processing the tensor.

C++

virtual void processDebugTensor(
                    void const* addr,
                    TensorLocation location,
                    DataType type,
                    Dims const& shape,
                    char const* name,
                    cudaStream_t stream) = 0;

Python

process_debug_tensor(self, addr, location, type, shape, name, stream)

When the function is invoked during execution, the debug tensor is passed via the parameters:

location: TensorLocation of the tensor
addr: pointer to buffer
type: data Type of the tensor
shape: shape of the tensor
name: name of the tensor
stream: Cuda stream object

The data will be in linear format.

Attach your listener to IExecutionContext.

C++

executionContext->setDebugListener(&debugListener);

Python

execution_context.set_debug_state(tensorName, flag)

Weight Streaming#

The weight streaming feature allows you to offload some weights from device memory to host memory. During network execution, these weights are streamed from the host to the device as needed. This technique can free up device memory, enabling you to run larger models or process larger batch sizes.

To enable this feature, during engine building, create a network with kSTRONGLY_TYPED and set kWEIGHT_STREAMING to builder config:

C++

…
builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kSTRONGLY_TYPED));
config->setFlag(BuilderFlag::kWEIGHT_STREAMING);

Python

builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED))
config.set_flag(trt.BuilderFlag.WEIGHT_STREAMING)

During runtime, deserialization allocates a host buffer to store all the weights instead of uploading them directly to the device. This can increase the host’s peak memory usage. You can use IStreamReaderV2 to deserialize directly from the engine file, avoiding needing a temporary buffer, which helps reduce peak memory usage. IStreamReaderV2 replaces the existing IStreamReader deserialization method.

After deserializing the engine, set the device memory budget for weights by:

C++

…
engine->setWeightStreamingBudgetV2(size)

Python

…
engine.weight_streaming_budget_v2 = size

The following APIs can help to determine the budget:

getStreamableWeightsSize() returns the total size of streamable weights.
getWeightStreamingScratchMemorySize() returns the extra scratch memory size for a context when weight streaming is enabled.
getDeviceMemorySizeV2() returns the total scratch memory size required by a context. If this API is called before enabling weight streaming by setWeightStreamingBudgetV2(), the return value will not include the extra scratch memory size required by weight streaming, which can be obtained using getWeightStreamingScratchMemorySize(). Otherwise, it will include this extra memory.

Additionally, you can combine information about the current free device memory size, context number, and other allocation needs.

TensorRT can also automatically determine a memory budget by getWeightStreamingAutomaticBudget(). However, due to limited information about the user’s specific memory allocation requirements, this automatically determined budget may be suboptimal and potentially lead to out-of-memory errors.

If the budget set by setWeightStreamingBudgetV2 is larger than the total size of streamable weights obtained by getStreamableWeightsSize(), the budget will be clipped to the total size, effectively disabling weight streaming.

You can query the budget set by getWeightStreamingBudgetV2().

The budget can be adjusted by setting it again when there is no active context for the engine.

After setting the budget, TensorRT will automatically determine which weights to retain on the device memory to maximize the overlap between computation and weight fetching.

Cross-Platform Compatibility#

By default, TensorRT engines can only be executed on the same platform (operating system and CPU architecture) where they were built. With build-time configuration, engines can be built to be compatible with other types of platforms. For example, to build an engine on Linux x86_64 platforms and expect the engine to run on Windows x86_64 platforms, configure the IBuilderConfig as follows:

config->setRuntimePlatform(nvinfer1::RuntimePlatform::kWINDOWS_AMD64);

The cross-platform engine might have performance differences from the natively built engine on the target platform. Additionally, it cannot run on the host platform it was built on.

When building a cross-platform engine that also requires version forward compatibility, kEXCLUDE_LEAN_RUNTIME must be set to exclude the target platform lean runtime.

Tiling Optimization#

Tiling optimization enables cross-kernel tiled inference. This technique leverages on-chip caching for continuous kernels in addition to kernel-level tiling. It can significantly enhance performance on platforms constrained by memory bandwidth.

To activate tiling optimization, perform the following steps:

Set the tiling optimization level. Use the following API to specify the duration TensorRT should dedicate to searching for a more effective tiling solution that could improve performance:
```
builderConfig->setTilingOptimizationLevel(level)
```
The optimization level is set to 0 by default, which means TensorRT will not perform any tiling optimization.

Increasing the level enables TensorRT to explore various strategies and larger search spaces for enhanced performance. However, note that this may significantly increase the engine build time.
Configure the L2 cache limit for tiling. Use the following API to provide TensorRT with an estimate of the L2 cache resources that can be allocated for the current engine during runtime:
```
builderConfig->setL2LimitForTiling()
```
This API is a hint to tell TensorRT how much L2 cache resources can be considered dedicated to the current TensorRT engine in the runtime. This will help TensorRT apply a better tiling solution for multiple tasks concurrently running on one GPU. Note that the usage of the L2 cache depends on the workload and heuristic; TensorRT may not apply this limit for all layers.

TensorRT manages the default value.

Advanced Topics#

Version Compatibility#

Manually Loading the Runtime#

Loading from Storage#

Using Version Compatibility with the ONNX Parser#

Hardware Compatibility#

Compatibility Checks#

Refitting an Engine#

Weight-Stripping#

Refitting a Weight-Stripped Engine Directly from ONNX#

Weight-Stripping Work with Lean Runtime#

Fine Grained Refit Build#

Stripping Weights with Fine-Grained Refit Build#

Algorithm Selection and Reproducible Builds#

Strongly Typed Networks#

Reduced Precision in Weakly-Typed Networks#

Network-Level Control of Precision#

Layer-Level Control of Precision#

TF32#

BF16#

Control of Computational Precision#

I/O Formats#

Sparsity#

Empty Tensors#

Reusing Input Buffers#

Engine Inspector#

Optimizer Callbacks#

Preview Features#

Debug Tensors#

Weight Streaming#

Cross-Platform Compatibility#

Tiling Optimization#