ImageResize#

Overview#

The ImageResize operator changes the image width and height by stretching/squeezing the image pixels. A wide range of combinations of input and output image sizes is supported. The resize scale factors can be any real numbers within the Limitations mentioned below.

Two kinds of interpolation methods are available: nearest neighbor (NN), and bilinear interpolation (BL).

The list of image formats supported in the current implementation are: RGB, RGBA, NV12, U8 and U16, just as an example. The implementation is general enough to support other image formats with minor changes in parameters.

Implementation#

The ImageResize operation considers that whole coordinates fall on pixel centers. Specifically speaking, denote the input image width and height to be \(W_0\) and \(H_0\), and the output image width and height to be \(W_1\) and \(H_1\). The output pixel at indices \((j_x, j_y)\) has coordinates in the scale of input image as \(\left((j_x + 0.5) * \frac{W_0}{W_1}, (j_y+0.5)*\frac{H_0}{H_1}\right)\).

For nearest neighbor, the nearest input pixel indices are calculated as:

(1)#\[i_x = NN\left[(j_x + 0.5) * \frac{W_0}{W_1} - 0.5\right]\]

(2)#\[i_y = NN\left[(j_y+0.5)*\frac{H_0}{H_1} - 0.5\right]\]

Here, \(NN\left[\cdot\right]\) gets the nearest integer for the real number. The output pixel value at indices \((j_x, j_y)\) equals to the input pixel value at indices \((i_x, i_y)\).

For bilinear interpolation, in each of X and Y directions we determine two adjacent indices where the fractional coordinate falls between their pixel centers:

(3)#\[i_x = \text{floor}\left[(j_x + 0.5) * \frac{W_0}{W_1}-0.5\right]\]

(4)#\[i_y = \text{floor}\left[(j_y+0.5)*\frac{H_0}{H_1}-0.5\right]\]

where \(\text{floor}\left[\cdot\right]\) gets the floor integer, which is the largest integer value no greater than the real number. Values at the 4 indices \((i_x, i_y)\), \((i_x + 1, i_y)\), \((i_x, i_y + 1)\), and \((i_x + 1, i_y + 1)\) on the input image are then utilized to do bilinear interpolation.

For upsampling, the involved input indices can exceed the image range, in which case we extend the boundary values to out-of-range pixels.

For image formats with two planes like NV12, the resize operation is applied to each plane independently.

Limitations#

The scale factors for resizing images, defined as (\(R_x = W_0/W_1\)) for the width and (\(R_y = H_0/H_1\)) for the height, must not exceed 3. This limitation applies specifically to downsampling scenarios, where the original dimensions are reduced. However, when upsampling, where dimensions are increased, there is no upper limit imposed on the scale factors. The restriction arises from limited VMEM buffer capacity. Nevertheless, in most applications, both downsampling and upsampling scale factors do not exceed 3, falling within the scope of the current implementation.

Another limitation is that the input and output image sizes must be larger than 64x64. The implementation employs tiling on the output image, which is explained in detail in the next section, and the smallest tile size allowed is 32x32. For upsampling, the input GatherScatterDataFlow (GSDF) data flow achieves the boundary-pixel-extension (BPE) for out of the range pixels by padding, but padding on both sides of one direction (i.e., both left and right, or both top and bottom) is not supported. So there must be at least 2 tiles in each direction. For image format NV12, the UV plane for an image of size 64x64 is of size 32x32, which has only one tile and is unfavorable. Consequently, the output image is required to be larger than 64x64. Meanwhile, when the input image is too small, it could also cause padding on both sides of one direction in a tile. We therefore give a universal condition that both input and output image be larger than 64x64. Again, this condition is satisfied by most practical usages.

Tiling on Output#

Tiling is adopted on the output image by RasterDataFlow (RDF). For each output tile, the range of input pixel indices required for interpolation is calculated, and then GSDF is applied to read only the necessary pixels by flexible offsets and tile sizes, as well as add padding by BPE if needed.

For image formats with multiple planes, the resize operation is done independently for each plane, and each plane has its own GSDF read and RDF write data flows. For multi-channel input/output in each plane, the pixel values are interleaved in the channel dimension (i.e. HWC layout).

CupvaSampler Configuration#

The resize operation of each tile is carried out by CupvaSampler.

Nearest Neighbor by CupvaSamplerIndices2D#

For nearest neighbor, the CupvaSamplerIndices2D interface is utilized for 2D lookup with rounding for fractional indices. The indices must be precomputed and provided.

Bilinear Interpolation by CupvaSamplerTiles#

For bilinear interpolation, the CupvaSamplerTiles interface is applied for 2D linear interpolation with auto-index generation. The fractional indices are generated internally.

Reduce Bank Conflicts#

The performance of CupvaSampler depends on the amount of bank conflicts, which is highly associated with the line pitch of its input. We align the input line pitch to \(32k + 2\) for bilinear interpolation, so that there is no bank conflict for the adjoining 4 points used in the interpolation of one pixel. This line pitch has empirically shown good performance for most scales.

Fractional Coordinate Generation#

Within each tile, the fractional coordinates of output pixels to be interpolated are represented as:

(5)#\[x_i = \text{offset}_x + i * \text{step}_x\]

(6)#\[y_j = \text{offset}_y + j * \text{step}_y\]

In the above, the step sizes equal to the scale factors \(\text{step}_x = R_x\) and \(\text{step}_y = R_y\), and the offsets are the relative coordinates of the first point in the tile.

For the CupvaSamplerIndices2D interface, fractional coordinates above are manually generated. It can be clearly seen from (5), (6) that coordinates of each tile are merely different in the offsets. As 2D lookup supports shifting indices by constant offsets, we generate the fractional coordinates of one tile with offsets 0 in advance, and only update the offsets later for each tile.

For the CupvaSamplerTiles interface, we only need to provide the transformation matrix \(\left[\text{offset}_x, \text{offset}_y, \text{step}_x, \text{step}_y\right]\), and the fractional coordinates are generated internally.

Two Levels of Fixed-Point Representation#

Fixed-point representation is employed for fractional coordinates. Within each tile, we use Q.16 as that is the highest fractional bits supported by CupvaSampler. However, if Q.16 is used throughout the computations in (1), (2), (3), (4), error in the fractional coordinate accumulates to be on the scale of \(\max(W_1, H_1) * 2^{-16}\), and propagates in the interpolation step to \(\max(W_1, H_1) * 2^{-16}*V_{max}\), where \(V_{max}\) is the maximum pixel value. The total error could become significant for large image sizes. To tackle this issue, we use Q.32 to calculate the starting indices of each tile. Q.32 is chosen as it is large enough for the tile start error to be less than \(2^{-16}\), so that error from the tile start is negligible compared with the error from Q.16 in CupvaSampler.

Pre/Post Processing#

Multi-channel Inputs#

For multi-channel inputs (RGB, RGBA, the UV plane of NV12), pixel values are interleaved in the channel dimension, but the interpolation has to be done for a single channel. Therefore, we introduce pre and post processing steps to split and merge channels. The procedure for resizing a multi-channel input is as follows:

Split channels of the input image by transpose load, resulting into transposed single-channel images;
For each channel, do sampling on the transposed input to get the transposed output;
Merge the transposed single-channel outputs by transpose store.

When conducting sampling on the transposed image, parameters for X and Y directions in the transformation matrix have to be switched: \(\left[\text{offset}_y, \text{offset}_x, \text{step}_y, \text{step}_x\right]\).

Buffers with transpose load/store must have line pitch aligned to \(64k+2\).

Fix the Padding of Multi-channel Inputs#

When pixels out of the image range are involved in interpolation, padding by BPE is applied in GSDF. But for multi-channel inputs interleaved in the channel dimension, padding by BPE on the left and right borders will only repeat the outtermost channel, instead of padding pixels in the channel-interleaved pattern as desired. For example for image format RGB8, padding on the left border by 1 pixel will add channels RRR with R from the first column of the image, and padding on the right border by 1 pixel will add channels BBB with B from the last column of the image.

To fix this issue, we introduce an extra pre-processing kernel function to correct the padded pixels. Only channels that are incorrect are fixed. Specifically, for padding on the left border, we update the second to the last channel of each pixel; while for padding on the right border, we update the first to the second last channel of each pixel.

Promotion/Demotion for Auto-index Generation#

For bilinear interpolation with auto-index generation, the CupvaSamplerTiles interface does not support input of U8 data types (U8, RGB, RGBA, NV12), for which we promote the input pixel values from U8 to U16 before calling the sampler and demote from U16 to U8 for sampling results.

For multi-channel inputs, the promotion/demotion are combined with splitting/merging channels in pre/post processing kernel functions.

Details for the pre- and post-processing steps are illustrated in the following figure:

../../_images/u8-sampler-nn-bl.png — Figure 1: flow chart for U8 data type images, bilinear interpolation and nearest neighbor#

Dynamic Tile Size#

Since the implementation covers multiple image formats with different channel numbers (proportion of buffers are different), a wide range of scale factors, and two interpolation methods with different auxiliary buffers, it is almost impossible to find a unified VMEM allocation that obtain high DMA efficiency for all scenarios. Moreover, performance of downsampling is bounded by the DMA latency, so better DMA efficiency can greatly affects the overall performance.

Accordingly, we design an optimization mechanism to calculate the best output tile size dynamically. The function first gets the optimal VMEM buffer allocation for the given image format determined in advance, and then check all conditions for a feasible tile size:

Tile width and height must be less than 256 pixels, required by the auto-index generation mode of CupvaSampler;
Tile width is better 64B aligned, to improve VDB efficiency in the DMA write data flow;
Tile height is aligned to 32 rows, to facilitate the pre and post processing kernels;
Check the input/output buffer size: for multi-channel input/output, line pitches are aligned to \(64k+2\) for transpose load/store;
Check the pre/post processing buffer size if pre/post processing exists: line pitch of the pre-processing buffer for bilinear interpolation is aligned to \(32k+2\);
Check the RDF padding size: if the output image size is not a multiple of the output tile size, padding size in the last column/row tile must be smaller than 256.

The optimization goal is to minimize the number of tiles, in order to get a relative large tile size and reduce the GSDF trigger overhead. This optimization goal is justified by satisfactory performance in practice.

The smallest output tile size allowed is 32x32, in terms of pixels. Since the input/output tile is interleaved in the channel dimension, the actual tile width is the calculated tile width times the channel number.

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

ResizeType	ImageFormat	InputImageSize	OutputImageSize	Execution Time	Submit Latency	Total Power
NN	U8	480x320	1920x1080	0.320ms	0.020ms	12.141W
NN	U8	1920x960	3840x2160	1.100ms	0.022ms	13.034W
NN	U8	1920x1080	1920x1080	0.320ms	0.021ms	13.519W
NN	U8	2880x1620	960x540	0.302ms	0.016ms	13.053W
NN	U8	3840x2160	1920x960	0.506ms	0.019ms	13.738W
NN	U16	480x320	1920x1080	0.353ms	0.020ms	13.317W
NN	U16	1920x960	3840x2160	1.193ms	0.023ms	14.302W
NN	U16	1920x1080	1920x1080	0.357ms	0.021ms	15.47W
NN	U16	2880x1620	960x540	0.591ms	0.018ms	12.953W
NN	U16	3840x2160	1920x960	1.087ms	0.021ms	13.538W
NN	RGB8	480x320	1920x1080	0.968ms	0.022ms	11.94W
NN	RGB8	1920x960	3840x2160	3.844ms	0.021ms	12.139W
NN	RGB8	1920x1080	1920x1080	1.151ms	0.021ms	12.239W
NN	RGB8	2880x1620	960x540	1.127ms	0.020ms	14.403W
NN	RGB8	3840x2160	1920x960	2.261ms	0.025ms	13.135W
NN	RGBA8	480x320	1920x1080	1.251ms	0.023ms	11.94W
NN	RGBA8	1920x960	3840x2160	5.017ms	0.023ms	12.139W
NN	RGBA8	1920x1080	1920x1080	1.487ms	0.022ms	12.622W
NN	RGBA8	2880x1620	960x540	1.398ms	0.021ms	13.337W
NN	RGBA8	3840x2160	1920x960	2.847ms	0.025ms	13.025W
NN	NV12	480x320	1920x1080	0.563ms	0.022ms	12.04W
NN	NV12	1920x960	3840x2160	2.032ms	0.025ms	12.241W
NN	NV12	1920x1080	1920x1080	0.596ms	0.022ms	12.825W
NN	NV12	2880x1620	960x540	0.558ms	0.018ms	13.135W
NN	NV12	3840x2160	1920x960	0.964ms	0.021ms	13.337W
LINEAR	U8	480x320	1920x1080	0.319ms	0.020ms	12.04W
LINEAR	U8	1920x960	3840x2160	1.210ms	0.023ms	12.723W
LINEAR	U8	1920x1080	1920x1080	0.391ms	0.021ms	13.208W
LINEAR	U8	2880x1620	960x540	0.367ms	0.016ms	13.819W
LINEAR	U8	3840x2160	1920x960	0.625ms	0.019ms	14.304W
LINEAR	U16	480x320	1920x1080	0.335ms	0.021ms	13.592W
LINEAR	U16	1920x960	3840x2160	1.226ms	0.023ms	14.001W
LINEAR	U16	1920x1080	1920x1080	0.338ms	0.022ms	15.469W
LINEAR	U16	2880x1620	960x540	0.572ms	0.017ms	13.536W
LINEAR	U16	3840x2160	1920x960	0.909ms	0.021ms	13.738W
LINEAR	RGB8	480x320	1920x1080	0.895ms	0.025ms	11.839W
LINEAR	RGB8	1920x960	3840x2160	3.702ms	0.021ms	12.04W
LINEAR	RGB8	1920x1080	1920x1080	1.118ms	0.022ms	13.006W
LINEAR	RGB8	2880x1620	960x540	0.931ms	0.020ms	14.302W
LINEAR	RGB8	3840x2160	1920x960	2.584ms	0.025ms	12.924W
LINEAR	RGBA8	480x320	1920x1080	1.238ms	0.024ms	11.94W
LINEAR	RGBA8	1920x960	3840x2160	5.005ms	0.022ms	12.038W
LINEAR	RGBA8	1920x1080	1920x1080	1.479ms	0.023ms	13.006W
LINEAR	RGBA8	2880x1620	960x540	1.483ms	0.021ms	13.236W
LINEAR	RGBA8	3840x2160	1920x960	3.280ms	0.024ms	12.824W
LINEAR	NV12	480x320	1920x1080	0.480ms	0.023ms	12.324W
LINEAR	NV12	1920x960	3840x2160	1.846ms	0.025ms	12.239W
LINEAR	NV12	1920x1080	1920x1080	0.619ms	0.024ms	12.723W
LINEAR	NV12	2880x1620	960x540	0.608ms	0.018ms	13.236W
LINEAR	NV12	3840x2160	1920x960	1.025ms	0.021ms	13.721W