Customizing DLA Memory Pools#
You can customize the size of the memory pools allocated to each DLA subnetwork in a network using the IBuilderConfig::setMemoryPoolLimit (C++) or the IBuilderConfig.set_memory_pool_limit (Python). There are three types of DLA memory pools (refer to the MemoryPoolType enum for details):
Managed SRAM
Behaves like a cache, and larger values can improve performance.
If no managed SRAM is available, DLA can still run by falling back to local DRAM.
On Orin, each DLA core has 1 MiB of dedicated SRAM. On Xavier, 4 MiB of SRAM is shared across multiple cores, including the 2 DLA cores.
Local DRAM
Used to store intermediate tensors in the DLA subnetwork. Larger values can allow larger subnetworks to be offloaded to DLA.
Global DRAM
Used to store weights in the DLA subnetwork. Larger values can allow larger subnetworks to be offloaded to DLA.
The memory required for each subnetwork can be less than the pool size, in which case a smaller amount will be allocated. The pool size serves only as an upper bound.
Note that all DLA memory pools require sizes that are powers of 2, with a minimum of 4 KiB. Violating this requirement results in a DLA loadable compilation failure.
In multi-subnetwork situations, it is important to remember that the pool sizes apply per DLA subnetwork, not for the whole network, so it is necessary to know the total amount of resources consumed. In particular, your network can consume at most twice the managed SRAM as the pool size in aggregate.
The default managed SRAM pool size for NVIDIA Orin is set to 0.5 MiB, whereas Xavier has 1 MiB as the default. Orin has a strict per-core limit, whereas Xavier has some flexibility. This Orin default guarantees that in all situations, the aggregate-managed SRAM consumption of your engine will stay below the hardware limit. Still, if your engine has only a single DLA subnetwork, this would mean your engine only consumes half the hardware limit, so you can achieve a performance boost by increasing the pool size to 1 MiB.
Determining DLA Memory Pool Usage#
Upon successfully compiling loadable from the given network, the builder reports the number of subnetwork candidates successfully compiled into loadable and the total amount of memory used per pool by those loadable. For each subnetwork candidate that failed due to insufficient memory, a message will be emitted to point out which memory pool was insufficient. In the verbose log, the builder also reports the memory pool requirements of each loadable.
Sparsity on DLA#
DLA on the NVIDIA Orin platform supports structured sparsity (SS), which can minimize latency and maximize throughput in production.
Structured Sparsity#
Structured sparsity (SS) accelerates a 2:4 sparsity pattern along the C dimension. In each contiguous block of four values, two values must be zero along C. Generally, SS provides the most benefit for INT8 convolutions that are math-bound and have a channel dimension that is a multiple of 128.’’
SS has several requirements and limitations.
Requirements
Only available for INT8 convolution for formats other than NHWC.
The channel size must be larger than 64.
Limitations
Only convolutions whose quantized INT8 weights are at most 256K can benefit from SS–in practice, the limitation can be more restrictive.
Only convolutions with
K % 64in{0, 1, 2, 4, 8, 16, 32}, whereKis the number of kernels (corresponding to the number of output channels), can benefit from SS in this release.