GPU Fallback Mode#
The GPUFallbackMode sets the builder to use GPU if a layer marked to run on DLA could not run on DLA. A layer cannot run on DLA due to the following reasons:
The
layeroperation is not supported on DLA.The parameters specified are out of the supported range for DLA.
The given batch size exceeds the maximum permissible DLA batch size. For more information, refer to the DLA Supported Layers and Restrictions section.
A combination of layers in the network causes the internal state to exceed what the DLA can support.
There are no DLA engines available on the platform.
Important
When GPU fallback is enabled, layers that cannot run on DLA silently fall back to GPU execution without an error. When GPU fallback is disabled, an error is emitted instead. Verify DLA layer assignments in the engine inspector output to confirm expected behavior.
I/O Formats on DLA#
DLA supports formats that are unique to the device and have constraints on their layout due to vector width byte requirements.
For DLA input tensors, kDLA_LINEAR(FP16, INT8), kDLA_HWC4(FP16, INT8), kCHW16(FP16), and kCHW32(INT8) are supported.
For DLA output tensors, only kDLA_LINEAR(FP16, INT8), kCHW16(FP16), and kCHW32(INT8) are supported.
For kCHW16 and kCHW32 formats, if C is not an integer multiple, it must be padded to the next 32-byte boundary.
For kDLA_LINEAR format, the stride along the W dimension must be padded up to 64 bytes. The memory format is equivalent to a C array with dimensions [N][C][H][roundUp(W, 64/elementSize)] where elementSize is 2 for FP16 and 1 for Int8, with the tensor coordinates (n, c, h, w) mapping to array subscript [n][c][h][w].
For kDLA_HWC4 format, the stride along the W dimension must be a multiple of 32 bytes on Xavier and 64 bytes on NVIDIA Orin.
When
C == 1, TensorRT maps the format to the native grayscale image format.When
C == 3orC == 4, it maps to the native color image format. IfC == 3, the stride for stepping along the W axis must be padded to4in elements.In this case, the padded channel is located at the 4th index. Ideally, the padding value does not matter because the DLA compiler paddings the 4th channel in the weights to zero; however, it is safe for the application to allocate a zero-filled buffer of four channels and populate three valid channels.
When
Cis{1, 3, 4}, then paddedC'is{1, 4, 4}respectively, the memory layout is equivalent to aCarray with dimensions[N][H][roundUp(W, 32/C'/elementSize)][C']whereelementSizeis2for FP16 and1for Int8. The tensor coordinates(n, c, h, w)mapping to array subscript[n][h][w][c],roundUpcalculates the smallest multiple of64/elementSizegreater than or equal toW.
When using kDLA_HWC4 as the DLA input format, it has the following requirements:
Cmust be1,3, or4The first layer must be convolution.
The convolution parameters must meet DLA requirements. For more information, refer to the DLA Supported Layers and Restrictions section.
When GPU fallback is enabled, TensorRT can insert reformatting layers to meet the DLA requirements. Otherwise, the input and output formats must be compatible with DLA. In all cases, the strides that TensorRT expects data to be formatted with can be obtained by querying IExecutionContext::getStrides.