CCL (Connected Component Labeling)#

Overview#

Connected Component Labeling (CCL) scans a binary or thresholded image and assigns a unique label ID to each group of connected foreground pixels. The output is a label map where all pixels belonging to the same component share the same integer label.

  • Treats all pixels with value > threshold (or nonzero) as foreground and others as background.

  • Groups foreground pixels into connected components according to the chosen connectivity and assigns each component a positive label. Output labels are not required to be sequentially packed, only consistent within each component.

  • Supports 4-connectivity and 8-connectivity:

    • 4-connectivity: A pixel is connected to its horizontally or vertically adjacent pixels.

    • 8-connectivity: A pixel is connected to its horizontally, vertically, or diagonally adjacent pixels.

  • Background pixels are labeled as 0.

Implementation Details#

Limitations#

  1. This implementation only supports U8 images. Pixels with values greater than the threshold are treated as foreground, and pixels not greater than the threshold are treated as background. If the input image is already binary, set the threshold to 0.

  2. This implementation only supports U16 output label maps. It allocates one 128KB super bank in VMEM to store the label map, thus supporting up to 65536 labels, where label 0 is reserved for background. If the intermediate label count exceeds 65536, the output number of labels returns -1.

  3. Output labels are not required to be sequentially packed, and the number of labels (excluding label 0) is returned.

Processing Steps#

The implementation is organized into three memory passes (steps).

In the first step, provisional labels are assigned to each pixel. The input image is reformatted by setting the line pitch to the real line pitch minus 2, introducing a skew of -2 pixels per row. After reformatting, pixels are loaded in a transposed fashion and labeled using vector operations according to either 4-connectivity or 8-connectivity. During this process, label equivalences are recorded in a label map stored in a dedicated super bank.

The second step collapses the label-equivalence tree in the label map. The label map is initially filled with an ascending sequence 0, 1, 2, …, 65535, i.e., each entry is initialized to its own index. A DLUT (Decoupled Lookup Unit) is then used to iteratively flatten the equivalence tree until every label points directly to its root label. The number of root labels is counted, which corresponds to the number of final labels.

The third step is relabeling. It reformats the label map to remove the -2 skew, then uses the DLUT again to look up each label’s root label and assign that root label as the final output.

Dataflows#

Four RasterDataFlows are used for the input image, output label map, and intermediate results. RasterDataFlow supports images whose dimensions are not exact multiples of the tile size. The input raster dataflows write into circular buffers in VMEM, while the output raster dataflows read from double buffers in VMEM.

The input raster dataflow applies horizontal padding of 64 pixels on each side with a padding value of 0, and no vertical padding. Because a -2 row skew is applied to the input image, the valid width of the intermediate label image becomes the original image width plus 64. The tile buffer in VMEM is accessed with transposed load/store, so the output intermediate label image dataflow sets trans_mode_1.

For the intermediate label image’s input raster dataflow, the region of interest (ROI) starts at an x-offset of 64 to exclude the 64-pixel horizontal halo. After reformatting, the final output label map has the same dimensions as the original input image.

A single SequenceDataFlow is used to output the number of labels; it transfers just one element.

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores. Idle power is approximately 7W when the PVA is not processing data.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

ImageSize

Layout

Connectivity

Execution Time

Submit Latency

Total Power

768x512

CHW

4

1.812ms

0.026ms

9.132W

768x512

CHW

8

1.700ms

0.027ms

9.134W