DLA Supported Layers and Restrictions#

This section lists the layers supported by DLA and the constraints associated with each layer.

General Restrictions#

The following restrictions apply to all layers while running on DLA:

  • The maximum supported batch size is 4096.

  • The maximum supported size for non-batch dimensions is 8192.

  • DLA does not support dynamic dimensions, so the profile’s min, max, and opt values must be equal for wildcard dimensions.

  • The runtime dimensions must be the same as the dimensions used for building.

  • TensorRT can split a network into multiple DLA loadable if any intermediate layers cannot run on DLA and GPUFallback is enabled. Otherwise, TensorRT can emit an error and fallback. For more information, refer to the GPU Fallback Mode section.

  • Due to hardware and software memory limitations, only 16 DLA loadable can be loaded concurrently per core.

  • Each layer must have the same batch size within a single DLA loadable. Layers with different batch sizes will be partitioned into separate DLA graphs.

Note

Batch size for DLA is the product of all index dimensions except the CHW dimensions. For example, if input dimensions are NPQRS, the effective batch size is N*P.

Layer Support and Restrictions#

The following list provides layer support and restrictions to the specified layers while running on DLA:

Table 12 DLA Layer Support Summary#

Layer

Precision

Key Limits

Activation

FP16, INT8

2D only; ReLU, Sigmoid, TanH, Clipped ReLU, Leaky ReLU

Comparison (Equal, Greater, Less)

INT8

Output must be cast to FP16/INT8 immediately

Concatenation

FP16, INT8

Channel axis only; ≥ 2 inputs; same spatial dims

Convolution / Fully Connected

FP16, INT8

2D only; kernel [1,32]; stride [1,8]; channels [1,8192]

Deconvolution

FP16, INT8

2D only; no grouped/dilated; padding must be 0

ElementWise

FP16, INT8

2D only; Sum, Sub, Product, Max, Min, Div, Pow

LRN

FP16 (INT8→FP16)

Window 3/5/7/9; ACROSS_CHANNELS only

Parametric ReLU

FP16, INT8

Slope must be build-time constant, same rank as input

Pooling

FP16, INT8

2D only; MAX, AVERAGE; window [1,8]; stride [1,16]

Reduce

FP16, INT8

4D input only; MAX operation; CHW axes

Resize

FP16, INT8

Exactly 4 scales; nearest [1,32], bilinear [1,4]

Scale

FP16, INT8

2D only; Uniform, Per-Channel, ElementWise

Shuffle

FP16, INT8

4D input only; dims [1,8192]; no batch transpose

Slice

FP16, INT8

4D input; CHW dims; static slicing only

Softmax

FP16, INT8

Orin only; dims [1,8192]; axis dim ≤ 1024 (optimized mode)

Unary

INT8

ABS, SIN, COS, ATAN; dims [1,8192]

Refer to each layer’s detailed restrictions below for the full set of constraints.

Activation layer
  • Only two spatial dimension operations are supported.

  • Both FP16 and INT8 are supported.

  • Functions supported: ReLU, Sigmoid, TanH, Clipped ReLU, and Leaky ReLU.

    • A negative slope is not supported for ReLU.

    • Clipped ReLU only supports values in the range [1, 127].

    • TanH, Sigmoid INT8 support is supported by auto-upgrading to FP16.

Comparison operations (Equal, Greater, Less)
  • It only supports INT8 layer precision and INT8 inputs except when using constants, which should be of the FP32 type and filled with the same value.

  • DLA requires that the comparison operation output be FP16 or INT8 type, so the comparison layer must be immediately followed by a Cast operation (IIdentityLayer or ICastLayer) to FP16 or INT8 and should have no direct consumers other than this Cast operation.

  • The ElementWise comparison layer and the subsequent IIdentityLayer or ICastLayer mentioned above explicitly set your device types to DLA and their precisions to INT8. Otherwise, these layers will run on the GPU.

  • Even with GPU fallback allowed, you should expect failures in engine construction in some cases, such as when DLA loadable compilation fails. If this is the case, unset the device types and/or precisions of both the ElementWise comparison layer and IIdentityLayer or ICastLayer to have both offloaded to GPU.

Concatenation layer
  • DLA supports concatenation only along the channel axis.

  • Concat must have at least two inputs.

  • All the inputs must have the same spatial dimensions.

  • Both FP16 and INT8 are supported.

  • With INT8 mode, the inputs’ dynamic range must be the same.

  • With INT8 mode, the dynamic range of output must be equal to each of the inputs.

Convolution and Fully Connected layers
  • Only two spatial dimension operations are supported.

  • Both FP16 and INT8 are supported.

  • Each dimension of the kernel size must be in the range [1, 32].

  • Padding must be in the range [0, 31].

  • Dimensions of padding must be less than the corresponding kernel dimension.

  • Dimensions of stride must be in the range [1, 8].

  • The number of output maps must be in the range [1, 8192].

  • Number of input channels [1, 8192].

  • For operations using the TensorFormat::kDLA_LINEAR, TensorFormat::kCHW16, and TensorFormat::kCHW32 formats, the number of groups must be in the range [1, 8192].

  • For operations using the TensorFormat::kDLA_HWC4 format, the number of groups must be in the range [1, 4].

  • Dilated convolution must be in the range [1, 32].

  • Operations are not supported if the CBUF size requirement wtBanksForOneKernel + minDataBanks exceeds the numConvBufBankAllotted limitation 16, where CBUF is the internal convolution cache that stores input weights and activation before operating on them, wtBanksForOneKernel is the minimum banks for one kernel to store the minimum weight/kernel elements needed for convolution, and minDataBanks is the minimum banks to store the minimum activation data needed for convolution. Detailed details are displayed in the logging output when a convolution layer fails validation due to CBUF constraints.

Deconvolution layer
  • Only two spatial dimensions are supported.

  • Both FP16 and INT8 are supported.

  • The kernel dimensions and strides must be in the range [1, 32] or must be 1x[64, 96, 128] and [64, 96, 128]x1.

  • TensorRT has disabled deconvolution square kernels and strides in the range [23 - 32] on DLA as they significantly slow down compilation.

  • The padding must be 0.

  • Grouped deconvolution must be 1.

  • Dilated deconvolutions must be 1.

  • The number of input channels must be in the range [1, 8192].

  • The number of output channels must be in the range [1, 8192].

ElementWise layer
  • Only two spatial dimension operations are supported.

  • Both FP16 and INT8 are supported.

  • Operations supported: Sum, Sub, Product, Max, Min, Div, Pow, Equal, Greater, and Less (described separately).

  • Broadcasting is supported when one of the operands has one of the following shape configurations:

    • NCHW (that is, shapes equal)

    • NC11 (that is, N and C equal, H and W are 1)

    • N111 (that is, N equal, C, H, and W are 1)

  • Div operation

    • The first input (dividend) can be INT8, FP16, or an FP32 constant. The second input (divisor) must be INT8 or an FP32 constant.

    • If one of the inputs is constant, all values of its weights must be the same, and the other input must be non-constant in INT8.

  • Pow operation

    • One input must be an FP32 constant filled with the same value; the other must be an INT8 non-constant.

LRN (Local Response Normalization) layer
  • Allowed window sizes are 3, 5, 7, or 9.

  • The normalization region supported is ACROSS_CHANNELS.

  • LRN INT8 is supported by auto-upgrading to FP16.

Parametric ReLU layer
  • Slope input must be a build time constant with the same rank as the input tensor.

Pooling layer
  • Only two spatial dimension operations are supported.

  • Both FP16 and INT8 are supported.

  • Operations supported: kMAX, kAVERAGE.

  • Dimensions of the window must be in the range [1, 8].

  • Dimensions of padding must be in the range [0, 7].

  • Dimensions of stride must be in the range [1, 16].

  • With INT8 mode, input and output tensor scales must be the same.

Reduce layer
  • Only supports 4D input tensors.

  • All input non-batch dimensions must be in the range [1, 8192].

  • Both FP16 and INT8 are supported.

  • Only supports MAX operation type where any combination of the CHW axes is reduced.

Resize layer
  • The number of scales must be exactly 4.

  • The first two scale elements must be exactly 1 (for unchanged batch and channel dimensions).

  • The last two elements in scales, representing the scale values along height and width dimensions, respectively, must be integer values in the range of [1, 32] in nearest-neighbor mode and [1, 4] in bilinear mode.

  • Note that for bilinear resize INT8 mode, when the input dynamic range is larger than the output dynamic range, the layer will be upgraded to FP16 to preserve accuracy. This can negatively affect the latency.

Scale layer
  • Only two spatial dimension operations are supported.

  • Both FP16 and INT8 are supported.

  • Mode supported: Uniform, Per-Channel, and ElementWise.

  • Only scale and shift operations are supported.

Shuffle layer
  • Only supports 4D input tensors.

  • All input non-batch dimensions must be in the range [1, 8192].

  • Note that DLA decomposes the layer into standalone transpose and reshape operations. This means that the above restrictions apply individually to each decomposed operation.

  • Batch dimensions cannot be involved in either reshapes or transposes.

Slice layer
  • Both FP16 and INT8 are supported.

  • It supports batch sizes up to the general DLA maximum.

  • All input non-batch dimensions must be in the range [1, 8192].

  • Only supports 4D inputs and slicing at CHW dimensions.

  • Only supports static slicing, so slice parameters must be provided statically using TensorRT ISliceLayer setter APIs or as constant input tensors.

Softmax layer
  • Only supported on NVIDIA Orin, not Xavier.

  • All input non-batch dimensions must be in the range [1, 8192].

  • The axis must be one of the non-batch dimensions.

  • Supports FP16 and INT8 precision.

  • Internally, there are two modes, and the mode is selected based on the given input tensor shape.

    • The accurate mode is triggered when all non-batch, non-axis dimensions are 1.

    • The optimized mode allows the non-batch, non-axis dimensions to be greater than 1 but restricts the axis dimension to 1024 and involves an approximation that can cause a small error in the output. The magnitude of the error increases as the size of the axis dimension approaches 1024.

Unary layer
  • DLA supports ABS, SIN, COS, and ATAN operation types.

  • For SIN, COS, and ATAN, input precision must be INT8.

  • All input non-batch dimensions must be in the range [1, 8192].

Inference on NVIDIA Orin#

Due to the difference in hardware specifications between NVIDIA Orin and Xavier DLA, FP16 convolution operations on NVIDIA Orin can experience an increase in latency of up to 2x.

On NVIDIA Orin, DLA stores weights for non-convolution operations (FP16 and INT8) inside loadable as FP19 values (which use 4-byte containers). The channel dimensions are padded to multiples of either 16 (FP16) or 32 (INT8) for those FP19 values. Especially in the case of large per-element Scale, Add, or Sub operations, this can inflate the size of the DLA loadable, inflating the engine containing such a loadable. Graph optimization can unintentionally trigger this behavior by changing the type of a layer, such as when an ElementWise multiplication layer with a constant layer as weights is fused into a scale layer.