ImageResize#

Overview#

The ImageResize operator changes the image width and height by stretching/squeezing the image pixels. A wide range of combinations of input and output image sizes is supported. The resize scale factors can be any real numbers within the Limitations mentioned below.

Two kinds of interpolation methods are available: nearest neighbor (NN), and bilinear interpolation (BL).

The list of image formats supported in the current implementation are: RGB, RGBA, NV12, U8 and U16, just as an example. The implementation is general enough to support other image formats with minor changes in parameters.

Implementation#

The ImageResize operation considers that whole coordinates fall on pixel centers. Specifically speaking, denote the input image width and height to be \(W_0\) and \(H_0\), and the output image width and height to be \(W_1\) and \(H_1\). The output pixel at indices \((j_x, j_y)\) has coordinates in the scale of input image as \(\left((j_x + 0.5) * \frac{W_0}{W_1}, (j_y+0.5)*\frac{H_0}{H_1}\right)\).

For nearest neighbor, the nearest input pixel indices are calculated as:

(1)#\[i_x = NN\left[(j_x + 0.5) * \frac{W_0}{W_1} - 0.5\right]\]
(2)#\[i_y = NN\left[(j_y+0.5)*\frac{H_0}{H_1} - 0.5\right]\]

Here, \(NN\left[\cdot\right]\) gets the nearest integer for the real number. The output pixel value at indices \((j_x, j_y)\) equals to the input pixel value at indices \((i_x, i_y)\).

For bilinear interpolation, in each of X and Y directions we determine two adjacent indices where the fractional coordinate falls between their pixel centers:

(3)#\[i_x = \text{floor}\left[(j_x + 0.5) * \frac{W_0}{W_1}-0.5\right]\]
(4)#\[i_y = \text{floor}\left[(j_y+0.5)*\frac{H_0}{H_1}-0.5\right]\]

where \(\text{floor}\left[\cdot\right]\) gets the floor integer, which is the largest integer value no greater than the real number. Values at the 4 indices \((i_x, i_y)\), \((i_x + 1, i_y)\), \((i_x, i_y + 1)\), and \((i_x + 1, i_y + 1)\) on the input image are then utilized to do bilinear interpolation.

For upsampling, the involved input indices can exceed the image range, in which case we extend the boundary values to out-of-range pixels.

For image formats with two planes like NV12, the resize operation is applied to each plane independently.

Limitations#

The scale factors for resizing images, defined as (\(R_x = W_0/W_1\)) for the width and (\(R_y = H_0/H_1\)) for the height, must not exceed 3. This limitation applies specifically to downsampling scenarios, where the original dimensions are reduced. However, when upsampling, where dimensions are increased, there is no upper limit imposed on the scale factors. The restriction arises from limited VMEM buffer capacity. Nevertheless, in most applications, both downsampling and upsampling scale factors do not exceed 3, falling within the scope of the current implementation.

Another limitation is that the input and output image sizes must be larger than 64x64. The implementation employs tiling on the output image, which is explained in detail in the next section, and the smallest tile size allowed is 32x32. For upsampling, the input GatherScatterDataFlow (GSDF) data flow achieves the boundary-pixel-extension (BPE) for out of the range pixels by padding, but padding on both sides of one direction (i.e., both left and right, or both top and bottom) is not supported. So there must be at least 2 tiles in each direction. For image format NV12, the UV plane for an image of size 64x64 is of size 32x32, which has only one tile and is unfavorable. Consequently, the output image is required to be larger than 64x64. Meanwhile, when the input image is too small, it could also cause padding on both sides of one direction in a tile. We therefore give a universal condition that both input and output image be larger than 64x64. Again, this condition is satisfied by most practical usages.

Tiling on Output#

Tiling is adopted on the output image by RasterDataFlow (RDF). For each output tile, the range of input pixel indices required for interpolation is calculated, and then GSDF is applied to read only the necessary pixels by flexible offsets and tile sizes, as well as add padding by BPE if needed.

For image formats with multiple planes, the resize operation is done independently for each plane, and each plane has its own GSDF read and RDF write data flows. For multi-channel input/output in each plane, the pixel values are interleaved in the channel dimension (i.e. HWC layout).

CupvaSampler Configuration#

The resize operation of each tile is carried out by CupvaSampler.

Nearest Neighbor by CupvaSamplerIndices2D#

For nearest neighbor, the CupvaSamplerIndices2D interface is utilized for 2D lookup with rounding for fractional indices. The indices must be precomputed and provided.

Bilinear Interpolation by CupvaSamplerTiles#

For bilinear interpolation, the CupvaSamplerTiles interface is applied for 2D linear interpolation with auto-index generation. The fractional indices are generated internally.

Reduce Bank Conflicts#

The performance of CupvaSampler depends on the amount of bank conflicts, which is highly associated with the line pitch of its input. We align the input line pitch to \(32k + 2\) for bilinear interpolation, so that there is no bank conflict for the adjoining 4 points used in the interpolation of one pixel. This line pitch has empirically shown good performance for most scales.

Fractional Coordinate Generation#

Within each tile, the fractional coordinates of output pixels to be interpolated are represented as:

(5)#\[x_i = \text{offset}_x + i * \text{step}_x\]
(6)#\[y_j = \text{offset}_y + j * \text{step}_y\]

In the above, the step sizes equal to the scale factors \(\text{step}_x = R_x\) and \(\text{step}_y = R_y\), and the offsets are the relative coordinates of the first point in the tile.

For the CupvaSamplerIndices2D interface, fractional coordinates above are manually generated. It can be clearly seen from (5), (6) that coordinates of each tile are merely different in the offsets. As 2D lookup supports shifting indices by constant offsets, we generate the fractional coordinates of one tile with offsets 0 in advance, and only update the offsets later for each tile.

For the CupvaSamplerTiles interface, we only need to provide the transformation matrix \(\left[\text{offset}_x, \text{offset}_y, \text{step}_x, \text{step}_y\right]\), and the fractional coordinates are generated internally.

Two Levels of Fixed-Point Representation#

Fixed-point representation is employed for fractional coordinates. Within each tile, we use Q.16 as that is the highest fractional bits supported by CupvaSampler. However, if Q.16 is used throughout the computations in (1), (2), (3), (4), error in the fractional coordinate accumulates to be on the scale of \(\max(W_1, H_1) * 2^{-16}\), and propagates in the interpolation step to \(\max(W_1, H_1) * 2^{-16}*V_{max}\), where \(V_{max}\) is the maximum pixel value. The total error could become significant for large image sizes. To tackle this issue, we use Q.32 to calculate the starting indices of each tile. Q.32 is chosen as it is large enough for the tile start error to be less than \(2^{-16}\), so that error from the tile start is negligible compared with the error from Q.16 in CupvaSampler.

Pre/Post Processing#

Multi-channel Inputs#

For multi-channel inputs (RGB, RGBA, the UV plane of NV12), pixel values are interleaved in the channel dimension, but the interpolation has to be done for a single channel. Therefore, we introduce pre and post processing steps to split and merge channels. The procedure for resizing a multi-channel input is as follows:

  1. Split channels of the input image by transpose load, resulting into transposed single-channel images;

  2. For each channel, do sampling on the transposed input to get the transposed output;

  3. Merge the transposed single-channel outputs by transpose store.

When conducting sampling on the transposed image, parameters for X and Y directions in the transformation matrix have to be switched: \(\left[\text{offset}_y, \text{offset}_x, \text{step}_y, \text{step}_x\right]\).

Buffers with transpose load/store must have line pitch aligned to \(64k+2\).

Fix the Padding of Multi-channel Inputs#

When pixels out of the image range are involved in interpolation, padding by BPE is applied in GSDF. But for multi-channel inputs interleaved in the channel dimension, padding by BPE on the left and right borders will only repeat the outtermost channel, instead of padding pixels in the channel-interleaved pattern as desired. For example for image format RGB8, padding on the left border by 1 pixel will add channels RRR with R from the first column of the image, and padding on the right border by 1 pixel will add channels BBB with B from the last column of the image.

To fix this issue, we introduce an extra pre-processing kernel function to correct the padded pixels. Only channels that are incorrect are fixed. Specifically, for padding on the left border, we update the second to the last channel of each pixel; while for padding on the right border, we update the first to the second last channel of each pixel.

Promotion/Demotion for Auto-index Generation#

For bilinear interpolation with auto-index generation, the CupvaSamplerTiles interface does not support input of U8 data types (U8, RGB, RGBA, NV12), for which we promote the input pixel values from U8 to U16 before calling the sampler and demote from U16 to U8 for sampling results.

For multi-channel inputs, the promotion/demotion are combined with splitting/merging channels in pre/post processing kernel functions.

Details for the pre- and post-processing steps are illustrated in the following figure:

../../_images/u8-sampler-nn-bl.png

Figure 1: flow chart for U8 data type images, bilinear interpolation and nearest neighbor#

Dynamic Tile Size#

Since the implementation covers multiple image formats with different channel numbers (proportion of buffers are different), a wide range of scale factors, and two interpolation methods with different auxiliary buffers, it is almost impossible to find a unified VMEM allocation that obtain high DMA efficiency for all scenarios. Moreover, performance of downsampling is bounded by the DMA latency, so better DMA efficiency can greatly affects the overall performance.

Accordingly, we design an optimization mechanism to calculate the best output tile size dynamically. The function first gets the optimal VMEM buffer allocation for the given image format determined in advance, and then check all conditions for a feasible tile size:

  1. Tile width and height must be less than 256 pixels, required by the auto-index generation mode of CupvaSampler;

  2. Tile width is better 64B aligned, to improve VDB efficiency in the DMA write data flow;

  3. Tile height is aligned to 32 rows, to facilitate the pre and post processing kernels;

  4. Check the input/output buffer size: for multi-channel input/output, line pitches are aligned to \(64k+2\) for transpose load/store;

  5. Check the pre/post processing buffer size if pre/post processing exists: line pitch of the pre-processing buffer for bilinear interpolation is aligned to \(32k+2\);

  6. Check the RDF padding size: if the output image size is not a multiple of the output tile size, padding size in the last column/row tile must be smaller than 256.

The optimization goal is to minimize the number of tiles, in order to get a relative large tile size and reduce the GSDF trigger overhead. This optimization goal is justified by satisfactory performance in practice.

The smallest output tile size allowed is 32x32, in terms of pixels. Since the input/output tile is interleaved in the channel dimension, the actual tile width is the calculated tile width times the channel number.

Performance#

Execution Time is the average time required to execute the operator on a single VPU core. Note that each PVA contains two VPU cores, which can operate in parallel to process two streams simultaneously, or reduce execution time by approximately half by splitting the workload between the two cores.

Total Power represents the average total power consumed by the module when the operator is executed concurrently on both VPU cores.

For detailed information on interpreting the performance table below and understanding the benchmarking setup, see Performance Benchmark.

ResizeType

ImageFormat

InputImageSize

OutputImageSize

Execution Time

Submit Latency

Total Power

NN

U8

480x320

1920x1080

0.320ms

0.020ms

12.141W

NN

U8

1920x960

3840x2160

1.100ms

0.022ms

13.034W

NN

U8

1920x1080

1920x1080

0.320ms

0.021ms

13.519W

NN

U8

2880x1620

960x540

0.302ms

0.016ms

13.053W

NN

U8

3840x2160

1920x960

0.506ms

0.019ms

13.738W

NN

U16

480x320

1920x1080

0.353ms

0.020ms

13.317W

NN

U16

1920x960

3840x2160

1.193ms

0.023ms

14.302W

NN

U16

1920x1080

1920x1080

0.357ms

0.021ms

15.47W

NN

U16

2880x1620

960x540

0.591ms

0.018ms

12.953W

NN

U16

3840x2160

1920x960

1.087ms

0.021ms

13.538W

NN

RGB8

480x320

1920x1080

0.968ms

0.022ms

11.94W

NN

RGB8

1920x960

3840x2160

3.844ms

0.021ms

12.139W

NN

RGB8

1920x1080

1920x1080

1.151ms

0.021ms

12.239W

NN

RGB8

2880x1620

960x540

1.127ms

0.020ms

14.403W

NN

RGB8

3840x2160

1920x960

2.261ms

0.025ms

13.135W

NN

RGBA8

480x320

1920x1080

1.251ms

0.023ms

11.94W

NN

RGBA8

1920x960

3840x2160

5.017ms

0.023ms

12.139W

NN

RGBA8

1920x1080

1920x1080

1.487ms

0.022ms

12.622W

NN

RGBA8

2880x1620

960x540

1.398ms

0.021ms

13.337W

NN

RGBA8

3840x2160

1920x960

2.847ms

0.025ms

13.025W

NN

NV12

480x320

1920x1080

0.563ms

0.022ms

12.04W

NN

NV12

1920x960

3840x2160

2.032ms

0.025ms

12.241W

NN

NV12

1920x1080

1920x1080

0.596ms

0.022ms

12.825W

NN

NV12

2880x1620

960x540

0.558ms

0.018ms

13.135W

NN

NV12

3840x2160

1920x960

0.964ms

0.021ms

13.337W

LINEAR

U8

480x320

1920x1080

0.319ms

0.020ms

12.04W

LINEAR

U8

1920x960

3840x2160

1.210ms

0.023ms

12.723W

LINEAR

U8

1920x1080

1920x1080

0.391ms

0.021ms

13.208W

LINEAR

U8

2880x1620

960x540

0.367ms

0.016ms

13.819W

LINEAR

U8

3840x2160

1920x960

0.625ms

0.019ms

14.304W

LINEAR

U16

480x320

1920x1080

0.335ms

0.021ms

13.592W

LINEAR

U16

1920x960

3840x2160

1.226ms

0.023ms

14.001W

LINEAR

U16

1920x1080

1920x1080

0.338ms

0.022ms

15.469W

LINEAR

U16

2880x1620

960x540

0.572ms

0.017ms

13.536W

LINEAR

U16

3840x2160

1920x960

0.909ms

0.021ms

13.738W

LINEAR

RGB8

480x320

1920x1080

0.895ms

0.025ms

11.839W

LINEAR

RGB8

1920x960

3840x2160

3.702ms

0.021ms

12.04W

LINEAR

RGB8

1920x1080

1920x1080

1.118ms

0.022ms

13.006W

LINEAR

RGB8

2880x1620

960x540

0.931ms

0.020ms

14.302W

LINEAR

RGB8

3840x2160

1920x960

2.584ms

0.025ms

12.924W

LINEAR

RGBA8

480x320

1920x1080

1.238ms

0.024ms

11.94W

LINEAR

RGBA8

1920x960

3840x2160

5.005ms

0.022ms

12.038W

LINEAR

RGBA8

1920x1080

1920x1080

1.479ms

0.023ms

13.006W

LINEAR

RGBA8

2880x1620

960x540

1.483ms

0.021ms

13.236W

LINEAR

RGBA8

3840x2160

1920x960

3.280ms

0.024ms

12.824W

LINEAR

NV12

480x320

1920x1080

0.480ms

0.023ms

12.324W

LINEAR

NV12

1920x960

3840x2160

1.846ms

0.025ms

12.239W

LINEAR

NV12

1920x1080

1920x1080

0.619ms

0.024ms

12.723W

LINEAR

NV12

2880x1620

960x540

0.608ms

0.018ms

13.236W

LINEAR

NV12

3840x2160

1920x960

1.025ms

0.021ms

13.721W