DLA Supported Layers and Restrictions#
This section lists the layers supported by DLA and the constraints associated with each layer.
General Restrictions#
The following restrictions apply to all layers while running on DLA:
The maximum supported batch size is 4096.
The maximum supported size for non-batch dimensions is 8192.
DLA does not support dynamic dimensions, so the profile’s
min,max, andoptvalues must be equal for wildcard dimensions.The runtime dimensions must be the same as the dimensions used for building.
TensorRT can split a network into multiple DLA loadable if any intermediate layers cannot run on DLA and
GPUFallbackis enabled. Otherwise, TensorRT can emit an error and fallback. For more information, refer to the GPU Fallback Mode section.Due to hardware and software memory limitations, only 16 DLA loadable can be loaded concurrently per core.
Each layer must have the same batch size within a single DLA loadable. Layers with different batch sizes will be partitioned into separate DLA graphs.
Note
Batch size for DLA is the product of all index dimensions except the CHW dimensions. For example, if input dimensions are NPQRS, the effective batch size is N*P.
Layer Support and Restrictions#
The following list provides layer support and restrictions to the specified layers while running on DLA:
Layer |
Precision |
Key Limits |
|---|---|---|
Activation |
FP16, INT8 |
2D only; ReLU, Sigmoid, TanH, Clipped ReLU, Leaky ReLU |
Comparison (Equal, Greater, Less) |
INT8 |
Output must be cast to FP16/INT8 immediately |
Concatenation |
FP16, INT8 |
Channel axis only; ≥ 2 inputs; same spatial dims |
Convolution / Fully Connected |
FP16, INT8 |
2D only; kernel [1,32]; stride [1,8]; channels [1,8192] |
Deconvolution |
FP16, INT8 |
2D only; no grouped/dilated; padding must be 0 |
ElementWise |
FP16, INT8 |
2D only; Sum, Sub, Product, Max, Min, Div, Pow |
LRN |
FP16 (INT8→FP16) |
Window 3/5/7/9; ACROSS_CHANNELS only |
Parametric ReLU |
FP16, INT8 |
Slope must be build-time constant, same rank as input |
Pooling |
FP16, INT8 |
2D only; MAX, AVERAGE; window [1,8]; stride [1,16] |
Reduce |
FP16, INT8 |
4D input only; MAX operation; CHW axes |
Resize |
FP16, INT8 |
Exactly 4 scales; nearest [1,32], bilinear [1,4] |
Scale |
FP16, INT8 |
2D only; Uniform, Per-Channel, ElementWise |
Shuffle |
FP16, INT8 |
4D input only; dims [1,8192]; no batch transpose |
Slice |
FP16, INT8 |
4D input; CHW dims; static slicing only |
Softmax |
FP16, INT8 |
Orin only; dims [1,8192]; axis dim ≤ 1024 (optimized mode) |
Unary |
INT8 |
ABS, SIN, COS, ATAN; dims [1,8192] |
Refer to each layer’s detailed restrictions below for the full set of constraints.
Activation layer
Only two spatial dimension operations are supported.
Both FP16 and INT8 are supported.
Functions supported:
ReLU,Sigmoid,TanH,Clipped ReLU, andLeaky ReLU.A negative slope is not supported for
ReLU.Clipped ReLUonly supports values in the range[1, 127].TanH,SigmoidINT8 support is supported by auto-upgrading to FP16.
Comparison operations (Equal, Greater, Less)
It only supports INT8 layer precision and INT8 inputs except when using constants, which should be of the FP32 type and filled with the same value.
DLA requires that the comparison operation output be FP16 or INT8 type, so the comparison layer must be immediately followed by a Cast operation (
IIdentityLayerorICastLayer) to FP16 or INT8 and should have no direct consumers other than this Cast operation.The
ElementWisecomparison layer and the subsequentIIdentityLayerorICastLayermentioned above explicitly set your device types to DLA and their precisions to INT8. Otherwise, these layers will run on the GPU.Even with GPU fallback allowed, you should expect failures in engine construction in some cases, such as when DLA loadable compilation fails. If this is the case, unset the device types and/or precisions of both the
ElementWisecomparison layer andIIdentityLayerorICastLayerto have both offloaded to GPU.
Concatenation layer
DLA supports concatenation only along the channel axis.
Concat must have at least two inputs.
All the inputs must have the same spatial dimensions.
Both FP16 and INT8 are supported.
With INT8 mode, the inputs’ dynamic range must be the same.
With INT8 mode, the dynamic range of output must be equal to each of the inputs.
Convolution and Fully Connected layers
Only two spatial dimension operations are supported.
Both FP16 and INT8 are supported.
Each dimension of the kernel size must be in the range
[1, 32].Padding must be in the range
[0, 31].Dimensions of padding must be less than the corresponding kernel dimension.
Dimensions of stride must be in the range
[1, 8].The number of output maps must be in the range
[1, 8192].Number of input channels
[1, 8192].For operations using the
TensorFormat::kDLA_LINEAR,TensorFormat::kCHW16, andTensorFormat::kCHW32formats, the number of groups must be in the range[1, 8192].For operations using the
TensorFormat::kDLA_HWC4format, the number of groups must be in the range[1, 4].Dilated convolution must be in the range
[1, 32].Operations are not supported if the CBUF size requirement
wtBanksForOneKernel + minDataBanksexceeds thenumConvBufBankAllottedlimitation16, where CBUF is the internal convolution cache that stores input weights and activation before operating on them,wtBanksForOneKernelis the minimum banks for one kernel to store the minimum weight/kernel elements needed for convolution, andminDataBanksis the minimum banks to store the minimum activation data needed for convolution. Detailed details are displayed in the logging output when a convolution layer fails validation due to CBUF constraints.
Deconvolution layer
Only two spatial dimensions are supported.
Both FP16 and INT8 are supported.
The kernel dimensions and strides must be in the range
[1, 32]or must be1x[64, 96, 128]and[64, 96, 128]x1.TensorRT has disabled deconvolution square kernels and strides in the range
[23 - 32]on DLA as they significantly slow down compilation.The padding must be
0.Grouped deconvolution must be
1.Dilated deconvolutions must be
1.The number of input channels must be in the range
[1, 8192].The number of output channels must be in the range
[1, 8192].
ElementWise layer
Only two spatial dimension operations are supported.
Both FP16 and INT8 are supported.
Operations supported:
Sum,Sub,Product,Max,Min,Div,Pow,Equal,Greater, andLess(described separately).Broadcasting is supported when one of the operands has one of the following shape configurations:
NCHW (that is, shapes equal)
NC11 (that is, N and C equal, H and W are 1)
N111 (that is, N equal, C, H, and W are 1)
DivoperationThe first input (dividend) can be INT8, FP16, or an FP32 constant. The second input (divisor) must be INT8 or an FP32 constant.
If one of the inputs is constant, all values of its weights must be the same, and the other input must be non-constant in INT8.
PowoperationOne input must be an FP32 constant filled with the same value; the other must be an INT8 non-constant.
LRN (Local Response Normalization) layer
Allowed window sizes are
3,5,7, or9.The normalization region supported is
ACROSS_CHANNELS.LRN INT8 is supported by auto-upgrading to FP16.
Parametric ReLU layer
Slope input must be a build time constant with the same rank as the input tensor.
Pooling layer
Only two spatial dimension operations are supported.
Both FP16 and INT8 are supported.
Operations supported:
kMAX,kAVERAGE.Dimensions of the window must be in the range
[1, 8].Dimensions of padding must be in the range
[0, 7].Dimensions of stride must be in the range
[1, 16].With INT8 mode, input and output tensor scales must be the same.
Reduce layer
Only supports 4D input tensors.
All input non-batch dimensions must be in the range
[1, 8192].Both FP16 and INT8 are supported.
Only supports MAX operation type where any combination of the CHW axes is reduced.
Resize layer
The number of scales must be exactly
4.The first two scale elements must be exactly
1(for unchanged batch and channel dimensions).The last two elements in scales, representing the scale values along height and width dimensions, respectively, must be integer values in the range of
[1, 32]in nearest-neighbor mode and[1, 4]in bilinear mode.Note that for bilinear resize INT8 mode, when the input dynamic range is larger than the output dynamic range, the layer will be upgraded to FP16 to preserve accuracy. This can negatively affect the latency.
Scale layer
Only two spatial dimension operations are supported.
Both FP16 and INT8 are supported.
Mode supported:
Uniform,Per-Channel, andElementWise.Only
scaleandshiftoperations are supported.
Shuffle layer
Only supports 4D input tensors.
All input non-batch dimensions must be in the range
[1, 8192].Note that DLA decomposes the layer into standalone transpose and reshape operations. This means that the above restrictions apply individually to each decomposed operation.
Batch dimensions cannot be involved in either reshapes or transposes.
Slice layer
Both FP16 and INT8 are supported.
It supports batch sizes up to the general DLA maximum.
All input non-batch dimensions must be in the range
[1, 8192].Only supports 4D inputs and slicing at CHW dimensions.
Only supports static slicing, so slice parameters must be provided statically using TensorRT
ISliceLayersetter APIs or as constant input tensors.
Softmax layer
Only supported on NVIDIA Orin, not Xavier.
All input non-batch dimensions must be in the range
[1, 8192].The axis must be one of the non-batch dimensions.
Supports FP16 and INT8 precision.
Internally, there are two modes, and the mode is selected based on the given input tensor shape.
The accurate mode is triggered when all non-batch, non-axis dimensions are
1.The optimized mode allows the non-batch, non-axis dimensions to be greater than
1but restricts the axis dimension to 1024 and involves an approximation that can cause a small error in the output. The magnitude of the error increases as the size of the axis dimension approaches 1024.
Unary layer
DLA supports
ABS,SIN,COS, andATANoperation types.For
SIN,COS, andATAN, input precision must be INT8.All input non-batch dimensions must be in the range
[1, 8192].
Inference on NVIDIA Orin#
Due to the difference in hardware specifications between NVIDIA Orin and Xavier DLA, FP16 convolution operations on NVIDIA Orin can experience an increase in latency of up to 2x.
On NVIDIA Orin, DLA stores weights for non-convolution operations (FP16 and INT8) inside loadable as FP19 values (which use 4-byte containers). The channel dimensions are padded to multiples of either 16 (FP16) or 32 (INT8) for those FP19 values. Especially in the case of large per-element Scale, Add, or Sub operations, this can inflate the size of the DLA loadable, inflating the engine containing such a loadable. Graph optimization can unintentionally trigger this behavior by changing the type of a layer, such as when an ElementWise multiplication layer with a constant layer as weights is fused into a scale layer.