ImageStats#

Overview#

The ImageStats operator performs a two dimensional (2D) operation on the input image to compute various statistics. Supported statistics are pixel count, and per channel statistics including sum, mean, variance, and covariance. Both variance and covariance are stored in a single 4x4 covariance matrix, where the diagonal is the variance and the non-diagonal elements are the covariance, considering 4 maximum number of channels.

A flag is used to specify which of the statistics are to be computed. An optional mask image can be provided to define which pixels are considered for the operation. Supported image formats are U8, RGB8, BGR8, NV12, and NV12_ER.

Implementation#

Limitations#

For image format with subsampling in chroma channels, e.g. NV12 and NV12_ER, the image width and height must be even numbers.

This is because when calculating statistics for chroma channels, we upsample the UV plane to the Y plane resolution with scale factor 2. Odd image sizes will cause the upsampled UV plane to have 1 more pixel in width and height than the Y plane, and then lead to inaccurate statistics results.

The product of image width and height must be less than 2^24 (4096 * 4096).

This limit comes from uint64_t arithmetic in post-processing stage.

Value in the mask image must be either 0 or 1.

We only support binary mask.

Flags#

For each of the statistics, there is a binary flag on different bits, defined in the enum type PVAImageStatFlag:

IMAGE_STAT_FLAG_PIXEL_COUNT = 1u << 0
IMAGE_STAT_FLAG_SUM = 1u << 1
IMAGE_STAT_FLAG_MEAN = 1u << 2 | IMAGE_STAT_FLAG_PIXEL_COUNT | IMAGE_STAT_FLAG_SUM
IMAGE_STAT_FLAG_VARIANCE = 1u << 3 | IMAGE_STAT_FLAG_MEAN
IMAGE_STAT_FLAG_COVARIANCE = 1u << 4 | IMAGE_STAT_FLAG_VARIANCE

As can be seen from the above definition, the individual flags have dependencies. The mean flag triggers the pixel count and sum flags, the variance flag triggers the mean flag, and the covariance flag triggers the variance flag. The parameter flags passed to the operator interface can be a combination of these individual flags by using bitwise OR.

In fact, there are only 6 combinations of individual flags, instead of \(2^5 = 32\) combinations, due to the flag dependencies. Equivalently, there are only 6 combinations of statistics to be computed:

pixel count
sum
pixel count, sum
pixel count, sum, mean
pixel count, sum, mean, variance
pixel count, sum, mean, variance, covariance

In order to reduce duplicate codes, there is only one copy of device kernel function calculating all the statistics. The above flags for individual statistics are passed to the device kernel function as CFLAGS when building the device executables. Computation for each statistics is only enabled when the corresponding flag is set.

Data Flows#

RasterDataFlow (RDF) is adopted to read the input image and mask image into VPU. The tile sizes are set to be 128x128 pixels. If the image width or height is not a multiple of 128, padding by 0 in RDF is applied, so that the padded 0 pixels have no contribution to the statistics. Therefore no extra care is needed for the boundary tiles.

SequenceDataFlow (SQDF) is utilized to move out computed statistics. The output structure for statistics is defined in the struct PVAImageStatOutput. A 1D tensor of data type U8 and length sizeof(PVAImageStatOutput) is used to store the statistics. The SQDF is configured with 1 transfer, transferring the 1D tensor to the host.

Mask#

A binary mask can be provided to mask off some pixels from calculation for statistics. The mask image must have the same resolution as the input image. In VPU kernel functions, since conditional branch instructions will hurt performance a lot, we apply the mask to the input image data by vector point-wise multiplication. Afterwards, pixels with mask value 1 remain the same, while others with mask value 0 are set to 0. We can then calculate for statistics the same as without mask, since the masked pixels with value 0 have no effect on the computation results.

The only exception lies in pixel count. For input image without mask, the pixel count is directly obtained from the image resolution, while for masked image, calculation is required to count the number of pixels with mask value 1 by summing up the mask image.

For NV12 and NV12_ER input images, the UV plane is indeed upsampled by scale factor 2. For each U/V pixel, directly multiplying it to the corresponding 4 mask pixels and calculate as 4 individual pixels results in lots of math instructions. To improve performance, we first sum up the 4 mask pixels, multiply the sum to the U/V pixel, and then use the result to calculate UV-only statistics, including U/V channel sum, variance and covariance. By doing this, the UV-only statistics are calculated in the UV plane resolution, instead of the Y plane resolution if directly using the 4 masked U/V pixels. Consequently, the number of math instructions is highly reduced due to the number of pixels is reduced to 1/4.

Integer Only Loops#

We avoid using floating-point arithmetic inside the critical loop for better performance and precision. Instead, we only do integer type summation and product in the loop, and leave the floating-point operations in mean and variance/covariance to the post processing stage, which is done only once for the whole image.

Explicitly, for per channel sum and mean, the VPU kernel functions accumulate the per channel sums:

\[\text{Sum}\left(c\right) = \sum_{i, j} \text{Pixel}\left(i, j, c\right)\]

For per channel variance and covariance, the VPU kernel functions accumulate the per channel sum of pixel product:

\[\text{SumProd}\left(c, c'\right) = \sum_{i, j} \text{Pixel}\left(i, j, c\right) * \text{Pixel}\left(i, j, c'\right)\]

Note that for the covariance matrix, we only need to calculate the upper triangular elements, and the lower triangular part is obtained by symmetry. In other words, we only compute for \(c \geq c'\) in the above formula.

Within each tile, in32_t type registers is used to accumulate these sums, and then added to the uint64_t type global accumulators.

Floating-point Post Processing#

After looping over the whole image, the post processing function is called to carry out floating-point operations, including division for mean and normalization for variance/covariance.

Consider covariance between channel \(c\) and \(c'\), where pixels are already aligned to the same resolution, and N represents the pixel count:

\[\begin{split}\begin{align*} \text{Cov}\left(c, c'\right) &= \frac{1}{N-1}\left(\sum_{i, j} \left(\text{Pixel}\left(i, j, c\right) - \text{Mean}\left(c\right)\right) * \left(\text{Pixel}\left(i, j, c'\right) - \text{Mean}\left(c'\right)\right)\right)\\ &= \frac{1}{N-1}\left(\sum_{i, j} \text{Pixel}\left(i, j, c\right) * \text{Pixel}\left(i, j, c'\right) - N * \text{Mean}\left(c\right) * \text{Mean}\left(c'\right)\right) \\ &= \frac{1}{N-1}\left(\text{SumProd}\left(c, c'\right) - \text{Sum}\left(c\right) * \text{Sum}\left(c'\right)/N\right) \end{align*}\end{split}\]

The two floating-point divisions in the above formula involve very large uint64_t numerators. To achieve better precision, we replace the floating-point divisions with the following two steps:

integer division, record the quotient and remainder;
calculate the fractional part by fixed-point representation Q.24 for the remainder.

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

ImageSize	ImageFormat	Stats	Mask	Execution Time	Submit Latency	Total Power
1920x1080	U8	COVARIANCE	OFF	0.103ms	0.014ms	15.388W
1920x1080	U8	COVARIANCE	ON	0.182ms	0.015ms	15.69W
1920x1080	RGB	COVARIANCE	OFF	0.828ms	0.015ms	12.04W
1920x1080	BGR	COVARIANCE	ON	1.018ms	0.016ms	11.757W
1920x1080	NV12_ER	COVARIANCE	ON	0.377ms	0.016ms	14.201W
1920x1080	NV12	COVARIANCE	OFF	0.240ms	0.015ms	14.584W