DLInference#

Overview#

The DLInference operator enables users of PVA Solutions to run ONNX models. This operator is a thin wrapper around the network function generated by pva-onnx-compiler, an MLIR-based tool that converts ONNX models to a C function. This C function configures all the tasks needed to run the model, encapsulated in a C struct pva_dl_lib_engine_t.

Layer Support and Limitations#

Compute Kernels#

Layer

Data Type

Limitations

Conv2D

INT8, FP16, FP32

Number of output channels must be a multiple of 64

Gemm/Matmul

INT8, FP32

Add

INT8, FP32

MaxPool2D

INT8, FP16

Kernel size = 3x3 or 5x5

AveragePool2D

INT8, FP32

Quantize

FP32

Scale must be scalar

Dequantize

INT8

Scale must be scalar

Layernorm

FP32

Self-attention

INT8

Head size must be 32

The INT8 Conv2D/Gemm kernels use TensorRT’s INT8 quantization scheme, with scale type being FP32.

Apart from these compute kernels, pva-onnx-compiler can also handle memory operations like slice, transpose, concat, etc. When fusion is possible, pva-onnx-compiler tries to eliminate memory operations in a best-effort manner.

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

Network

InputSize

InputDataType

Execution Time

Submit Latency

Total Power

swin_t

3x224x224

S8

30.701ms

2.345ms

12.705W

Compatibility#

Supported only on the Orin platform.