Conv2d#

Overview#

The Conv2d operator performs general 2D convolution operations with a user-defined filter kernel. It applies a custom kernel to an input image, enabling a wide range of image processing operations such as edge detection, sharpening, blurring, and custom filtering effects. The current implementation supports 3x3, 5x5, and 7x7 kernels.

Algorithm Description#

Conv2d is implemented as a convolution operation on the input image where the kernel has user-defined weights:

\[dst[x, y] = \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} src[x+i-\frac{n-1}{2}, y+j-\frac{n-1}{2}] \cdot kernel[i, j]\]

where \(n\) is the width and height of the kernel (3, 5, or 7), and \(kernel[i, j]\) represents the user-defined kernel coefficients.

Implementation Details#

Kernel Reformatting#

The Conv2d operator internally reformats the kernel coefficients into a format optimized for the hardware convolution instructions. This reformatting is automatically performed by the operator and is transparent to the user.

The reformatting process involves:

Aligning kernel width to a multiple of 4 (padding with zeros)
Creating two kernel variants with different row padding
Interleaving the kernel data for optimal memory access patterns

For uint8/int8 data types, the kernel is reformatted into structures suitable for the vfilt4x2x2 instruction.

For uint16/int16 data types, the kernel is reformatted into structures suitable for the vfilt4x2 instruction.

Refer to Conv2d primitive documentation for detailed information on how the reformatted kernel is used in the convolution computation.

Dataflow Configuration#

Refer to GaussianFilter operator documentation for more details. Conv2d operator uses the same dataflow configuration as GaussianFilter operator.

Buffer Allocation#

Refer to GaussianFilter operator documentation for more details. Conv2d operator uses the same buffer allocation as GaussianFilter operator.

Kernel Implementation#

Conv2d operator uses Conv2d primitive to perform convolution operations. Please refer to Conv2d primitive documentation for more details on the implementation, including:

Hardware instruction usage (vfilt4x2x2 for uint8/int8, vfilt4x2 for uint16/int16)
Detailed explanation of the convolution computation with reformatted kernels

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

ImageSize	DataType	KernelSize	Execution Time	Submit Latency	Total Power
1920x1080	U8	3x3	0.159ms	0.024ms	15.369W
1920x1080	U8	5x5	0.161ms	0.023ms	16.571W
1920x1080	U8	7x7	0.167ms	0.023ms	17.774W
1920x1080	S8	3x3	0.158ms	0.024ms	15.772W
1920x1080	S8	5x5	0.161ms	0.024ms	16.972W
1920x1080	S8	7x7	0.166ms	0.025ms	17.771W
1920x1080	U16	3x3	0.301ms	0.023ms	16.173W
1920x1080	U16	5x5	0.356ms	0.023ms	17.07W
1920x1080	U16	7x7	0.457ms	0.023ms	16.557W
1920x1080	S16	3x3	0.301ms	0.023ms	16.574W
1920x1080	S16	5x5	0.356ms	0.025ms	17.973W
1920x1080	S16	7x7	0.457ms	0.023ms	16.958W

Compatibility#

Requires PVA SDK 2.6.0 and later.