Conv2d#
Overview#
The Conv2d operator performs general 2D convolution operations with a user-defined filter kernel. It applies a custom kernel to an input image, enabling a wide range of image processing operations such as edge detection, sharpening, blurring, and custom filtering effects. The current implementation supports 3x3, 5x5, and 7x7 kernels.
Algorithm Description#
Conv2d is implemented as a convolution operation on the input image where the kernel has user-defined weights:
where \(n\) is the width and height of the kernel (3, 5, or 7), and \(kernel[i, j]\) represents the user-defined kernel coefficients.
Implementation Details#
Kernel Reformatting#
The Conv2d operator internally reformats the kernel coefficients into a format optimized for the hardware convolution instructions. This reformatting is automatically performed by the operator and is transparent to the user.
The reformatting process involves:
Aligning kernel width to a multiple of 4 (padding with zeros)
Creating two kernel variants with different row padding
Interleaving the kernel data for optimal memory access patterns
For uint8/int8 data types, the kernel is reformatted into structures suitable for the vfilt4x2x2 instruction.
For uint16/int16 data types, the kernel is reformatted into structures suitable for the vfilt4x2 instruction.
Refer to Conv2d primitive documentation for detailed information on how the reformatted kernel is used in the convolution computation.
Dataflow Configuration#
Refer to GaussianFilter operator documentation for more details. Conv2d operator uses the same dataflow configuration as GaussianFilter operator.
Buffer Allocation#
Refer to GaussianFilter operator documentation for more details. Conv2d operator uses the same buffer allocation as GaussianFilter operator.
Kernel Implementation#
Conv2d operator uses Conv2d primitive to perform convolution operations. Please refer to Conv2d primitive documentation for more details on the implementation, including:
Hardware instruction usage (vfilt4x2x2 for uint8/int8, vfilt4x2 for uint16/int16)
Detailed explanation of the convolution computation with reformatted kernels
Performance#
Execution Time is the average time required to execute the operator on a single VPU core.
Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.
Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores.
Idle power is approximately 7W when the PVA is not processing data.
For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.
ImageSize |
DataType |
KernelSize |
Execution Time |
Submit Latency |
Total Power |
|---|---|---|---|---|---|
1920x1080 |
U8 |
3x3 |
0.159ms |
0.024ms |
15.369W |
1920x1080 |
U8 |
5x5 |
0.161ms |
0.023ms |
16.571W |
1920x1080 |
U8 |
7x7 |
0.167ms |
0.023ms |
17.774W |
1920x1080 |
S8 |
3x3 |
0.158ms |
0.024ms |
15.772W |
1920x1080 |
S8 |
5x5 |
0.161ms |
0.024ms |
16.972W |
1920x1080 |
S8 |
7x7 |
0.166ms |
0.025ms |
17.771W |
1920x1080 |
U16 |
3x3 |
0.301ms |
0.023ms |
16.173W |
1920x1080 |
U16 |
5x5 |
0.356ms |
0.023ms |
17.07W |
1920x1080 |
U16 |
7x7 |
0.457ms |
0.023ms |
16.557W |
1920x1080 |
S16 |
3x3 |
0.301ms |
0.023ms |
16.574W |
1920x1080 |
S16 |
5x5 |
0.356ms |
0.025ms |
17.973W |
1920x1080 |
S16 |
7x7 |
0.457ms |
0.023ms |
16.958W |
Compatibility#
Requires PVA SDK 2.6.0 and later.