DLInference#
Overview#
The DLInference operator enables users of PVA Solutions to run ONNX models.
This operator is a thin wrapper around the network function generated by pva-onnx-compiler,
an MLIR-based tool that converts ONNX models to a C function.
This C function configures all the tasks needed to run the model, encapsulated in a C struct pva_dl_lib_engine_t
.
Layer Support and Limitations#
Layer |
Data Type |
Limitations |
---|---|---|
Conv2D |
INT8, FP16, FP32 |
Number of output channels must be a multiple of 64 |
Gemm/Matmul |
INT8, FP32 |
|
Add |
INT8, FP32 |
|
MaxPool2D |
INT8, FP16 |
Kernel size = 3x3 or 5x5 |
AveragePool2D |
INT8, FP32 |
|
Quantize |
FP32 |
Scale must be scalar |
Dequantize |
INT8 |
Scale must be scalar |
Layernorm |
FP32 |
|
Self-attention |
INT8 |
Head size must be 32 |
The INT8 Conv2D/Gemm kernels use TensorRT’s INT8 quantization scheme, with scale type being FP32.
Apart from these compute kernels, pva-onnx-compiler can also handle memory operations like slice, transpose, concat, etc. When fusion is possible, pva-onnx-compiler tries to eliminate memory operations in a best-effort manner.
Performance#
Execution Time
is the average time required to execute the operator on a single VPU core.
Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.
Total Power
represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores.
For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.
Network |
InputSize |
InputDataType |
Execution Time |
Submit Latency |
Total Power |
---|---|---|---|---|---|
swin_t |
3x224x224 |
S8 |
30.701ms |
2.345ms |
12.705W |
Compatibility#
Supported only on the Orin platform.