DLA Supported Layers and Restrictions#

This section lists the layers supported by DLA and the constraints associated with each layer.

General Restrictions#

The following restrictions apply to all layers while running on DLA:

The maximum supported batch size is 4096.
The maximum supported size for non-batch dimensions is 8192.
DLA does not support dynamic dimensions, so the profile’s min, max, and opt values must be equal for wildcard dimensions.
The runtime dimensions must be the same as the dimensions used for building.
TensorRT can split a network into multiple DLA loadable if any intermediate layers cannot run on DLA and GPUFallback is enabled. Otherwise, TensorRT can emit an error and fallback. For more information, refer to the GPU Fallback Mode section.
Due to hardware and software memory limitations, only 16 DLA loadable can be loaded concurrently per core.
Each layer must have the same batch size within a single DLA loadable. Layers with different batch sizes will be partitioned into separate DLA graphs.

Note

Batch size for DLA is the product of all index dimensions except the CHW dimensions. For example, if input dimensions are NPQRS, the effective batch size is N*P.

Layer Support and Restrictions#

The following list provides layer support and restrictions to the specified layers while running on DLA:

Table 12 DLA Layer Support Summary#
Layer	Precision	Key Limits
Activation	FP16, INT8	2D only; ReLU, Sigmoid, TanH, Clipped ReLU, Leaky ReLU
Comparison (Equal, Greater, Less)	INT8	Output must be cast to FP16/INT8 immediately
Concatenation	FP16, INT8	Channel axis only; ≥ 2 inputs; same spatial dims
Convolution / Fully Connected	FP16, INT8	2D only; kernel [1,32]; stride [1,8]; channels [1,8192]
Deconvolution	FP16, INT8	2D only; no grouped/dilated; padding must be 0
ElementWise	FP16, INT8	2D only; Sum, Sub, Product, Max, Min, Div, Pow
LRN	FP16 (INT8→FP16)	Window 3/5/7/9; ACROSS_CHANNELS only
Parametric ReLU	FP16, INT8	Slope must be build-time constant, same rank as input
Pooling	FP16, INT8	2D only; MAX, AVERAGE; window [1,8]; stride [1,16]
Reduce	FP16, INT8	4D input only; MAX operation; CHW axes
Resize	FP16, INT8	Exactly 4 scales; nearest [1,32], bilinear [1,4]
Scale	FP16, INT8	2D only; Uniform, Per-Channel, ElementWise
Shuffle	FP16, INT8	4D input only; dims [1,8192]; no batch transpose
Slice	FP16, INT8	4D input; CHW dims; static slicing only
Softmax	FP16, INT8	Orin only; dims [1,8192]; axis dim ≤ 1024 (optimized mode)
Unary	INT8	ABS, SIN, COS, ATAN; dims [1,8192]

Refer to each layer’s detailed restrictions below for the full set of constraints.

Inference on NVIDIA Orin#

Due to the difference in hardware specifications between NVIDIA Orin and Xavier DLA, FP16 convolution operations on NVIDIA Orin can experience an increase in latency of up to 2x.

On NVIDIA Orin, DLA stores weights for non-convolution operations (FP16 and INT8) inside loadable as FP19 values (which use 4-byte containers). The channel dimensions are padded to multiples of either 16 (FP16) or 32 (INT8) for those FP19 values. Especially in the case of large per-element Scale, Add, or Sub operations, this can inflate the size of the DLA loadable, inflating the engine containing such a loadable. Graph optimization can unintentionally trigger this behavior by changing the type of a layer, such as when an ElementWise multiplication layer with a constant layer as weights is fused into a scale layer.