ImageHistogram#

Overview#

The image histogram operator calculates the histogram of a grayscale image, where the histogram represents the frequency distribution of pixel intensity values. The operator accepts parameters for the start and end of the intensity range and the number of bins, allowing for greater flexibility in analyzing different segments of an image’s intensity values.

Algorithm Description#

The algorithm partitions the distribution into a number of bins and counts the number of each pixel value within the bins. A pixel with intensity i will result in incrementing histogram bin bi where,

\[b_i = \frac{(i - \text{start}) \times \text{numBins}}{(\text{end} - \text{start})}\]

Implementation Details#

Parameters#

  • Input data type: u8 or u16.

  • Start: The lower bound of the pixel intensity range to consider for the histogram.

  • End: The upper bound of the pixel intensity range to consider for the histogram.

  • NumBins: The number of bins to divide the intensity range into.

  • Output data type: u32 or s32.

Dataflow Configuration#

1 RasterDataflow is used to split the whole image into tiles with size 128x128 and transfer them from DRAM into VMEM one by one.

1 SequenceDataflow is used to transfer the final histogram bins result from VMEM to DRAM at the end of the VPU execution.

Buffer Allocation#

5 VMEM buffers are needed,

  • 1 input buffer with double buffering for each tile of the input image pixel intensity.

  • 1 histogram buffer for the output of the histogram operation.

  • 1 output buffer for the final histogram bins result.

  • 1 valid buffer for checking validation of pixel intensity.

  • 1 mask buffer for checking pixel location in the horizonal direction of the input image.

Kernel Implementation#

For each pixel intensity,

  1. Check whether it is inside the range of [start, end).

    • Store the validation value in the valid buffer.

  2. Mask off if it is outside the image width in the horizonal direction.

    • Store the validation value in the valid buffer.

    • Optional step to handle the case that image width is not multiple of tile width.

  3. Calculate the bin index according to the pixel intensity value.

    Note

    Since PVA uses fixed-point for the bin index calculation to achieve better performance, the final histogram bins result might not match in some corner cases (e.g., range of (end - start) and numBins with some kinds of odd value) with CPU and CUDA because they perform floating point calculations. It is better to use a well defined set of values for range of (end - start) and numBins like the power of 2 to avoid the mismatch.

  4. Execute the histogram operation with the bin index and the validation value.

    • If one pixel intensity is invalid, set the bin index with value 0 to handle the out-of-range case.

    • The specific VPU intrinsics used in the kernel is the histogram vhist_simple_*. The sources of this instruction are indices and weights for weighted histogram. The indexed entries are updated by adding the corresponding weights.
      • For input data type u8, 16-way histogram vhist_simple_16w is used.

      • For input data type u16, because of the VMEM capacity, 16-way histogram vhist_simple_16w is used with the bin number ranging in [1, 2048], 8-way histogram vhist_simple_8w is used with the bin number ranging in [2049, 4096], 4-way histogram vhist_simple_4w is used with the bin number ranging in [4097, 8192] and 2-way histogram vhist_simple_2w is used with the bin ranging in [8193, 16384].

      Note

      Since the K-way histogram can achieve up to K histogram updates per cycle, use the bin number ranging in [1, 2048] to get the best performance, the implementation of kernels for the bin number ranging in [2049, 16384] is a tradeoff between functionality and performance (which can support larger bin number but with worse performance).

  5. Reduce the output of the histogram operation to get the final histogram bins result.

    • The parallelism granularity of the reduction kernel is according to the previous histogram operation.

Performance#

ImageSize

DataType

Start/End

NumBins

Execution Time

Submit Latency

Total Power

1920x1080

U8

0/256

64

0.217ms

0.017ms

12.616W

1920x1080

U8

0/256

128

0.218ms

0.017ms

12.232W

1920x1080

U8

0/256

256

0.219ms

0.014ms

12.716W

1920x1080

U16

0/65536

64

0.317ms

0.016ms

12.915W

1920x1080

U16

0/65536

128

0.318ms

0.015ms

12.915W

1920x1080

U16

0/65536

2048

0.334ms

0.014ms

12.915W

For detailed information on interpreting the performance table above and understanding the benchmarking setup, see Performance Benchmark.