Morphology#

Overview#

Morphology algorithms perform two-dimensional (2D) filtering operations on input images using 2D boolean kernels. The current implementation supports 3x3 and 5x5 kernels.

Erode Filter: Returns the minimum value of the pixels masked by the boolean kernel. This operation is typically used to reduce the size of foreground regions in binary images.
Dilate Filter: Returns the maximum value of the pixels masked by the boolean kernel. This operation is commonly used to increase the size of foreground regions in binary images.

The following kernel types are supported:

Rectangle Kernel: All elements are set to 1, available for both 3x3 and 5x5 kernel sizes.
Cross Kernel: Corners are set to 0, and the rest are set to 1, available for the 3x3 kernel size.

Multiple back-to-back morphological operations can be performed to achieve larger kernel sizes. For example, a 7x7 kernel effect can be achieved by sequentially applying a 3x3 kernel followed by a 5x5 kernel. Additionally, the dilate filter can be applied after the erode filter to perform a morphological opening on an image, which helps in removing small objects from the foreground.

Implementation Details#

Dataflow Configuration#

The RasterDataFlow (RDF) with halo and circular buffer is an ideal fit for the Morphology operator, as it processes pixels in 2D sliding windows. The RDF halo API is used to configure the boundary padding and the size of overlapped pixels, which is half of the mask width, for the source RDF. RDF automatically handles the padding of border pixels based on the borderMode and borderValue parameters.

Loop Configuration#

The VPU features Address Generators (AGEN) hardware for multi-dimensional address generation, enabling efficient looping and address updates with zero overhead. Load and store AGENs are configured once per task, and the base address is updated for each tile. Morphology operator uses 4 dimensional AGENS for loading pixels and 3 dimensional AGENS for storing pixels. The RDF Acquire/Release semantics conveniently provide the base address of the transferred tile for AGEN updates. The input AGEN is configured to load data from circular RDF buffer. cupvaRasterDataFlowGetCbLen and cupvaRasterDataFlowGetLinePitch APIs are used to get the circular buffer length and line pitch, respectively, for configuring the input AGEN.

Vectorization#

The VPU features vector min3/max3 instructions that are highly efficient for morphological operations. These instructions allow the minimum or maximum of three pixel values to be computed in parallel across all lanes of the source vector inputs. This translates to 32 parallel 3-way min/max operations for 8-bit data and 16 parallel 3-way min/max operations for 16-bit data per cycle.

Each iteration of a 3x3 rectangle kernel involves deinterleaving double vector loads to obtain a 4x4 block of single vector input data. This is used to compute a 2x2 block of single vector output data. Loading 4x4 blocks instead of 3x3 blocks helps reduce the number of load and math operations. First, min3/max3 values are computed in the horizontal direction. Then, min3/max3 of horizontal outputs are computed for the vertical direction. When computing the second output row, the horizontal min3/max3 computations for the 2nd and 3rd input rows are not repeated. These values are already computed for the first output row, thus reducing the number of operations.

The chess_loop_range() and chess_prepare_for_pipelining() pragmas provide the compiler with additional information about the loop’s behavior, enabling it to generate a more optimal schedule.

To avoid the need for separate kernel implementations for different data types, we use macros to select vector types at compile time based on the data type.

Performance#

ImageSize	Operation	Format	MaskSize	Execution Time	Submit Latency	Total Power
1920x1080	ERODE	U8	3x3	0.160ms	0.019ms	16.042W
1920x1080	ERODE	U8	5x5	0.264ms	0.018ms	14.848W
1920x1080	ERODE	S16	3x3	0.306ms	0.018ms	16.704W
1920x1080	ERODE	S16	5x5	0.493ms	0.018ms	15.039W
1920x1080	DILATE	U8	3x3	0.160ms	0.019ms	16.423W
1920x1080	DILATE	U8	5x5	0.264ms	0.020ms	14.757W
1920x1080	DILATE	S16	3x3	0.307ms	0.023ms	16.321W
1920x1080	DILATE	S16	5x5	0.493ms	0.018ms	15.42W

For detailed information on interpreting the performance table above and understanding the benchmarking setup, see Performance Benchmark.