DLInference#

Overview#

The DLInference operator enables users of PVA Solutions to run ONNX models. This operator is a thin wrapper around the network function generated by pva-onnx-compiler, an MLIR-based tool that converts ONNX models to a C function. This C function configures all the tasks needed to run the model, encapsulated in a C struct pva_dl_lib_engine_t.

Layer Support and Limitations#

Compute Kernels#
Layer	Data Type	Limitations
Conv2D	INT8, FP16, FP32	Number of output channels must be a multiple of 64
Gemm/Matmul	INT8, FP32
Add	INT8, FP32
MaxPool2D	INT8, FP16	Kernel size = 3x3 or 5x5
AveragePool2D	INT8, FP32
Quantize	FP32	Scale must be scalar
Dequantize	INT8	Scale must be scalar
Layernorm	FP32
Self-attention	INT8	Head size must be 32

The INT8 Conv2D/Gemm kernels use TensorRT’s INT8 quantization scheme, with scale type being FP32.

Apart from these compute kernels, pva-onnx-compiler can also handle memory operations like slice, transpose, concat, etc. When fusion is possible, pva-onnx-compiler tries to eliminate memory operations in a best-effort manner.

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

Network	InputSize	InputDataType	Execution Time	Submit Latency	Total Power
swin_t	3x224x224	S8	30.701ms	2.345ms	12.705W

Compatibility#

Supported only on the Orin platform.