VPU Programming Basics#

The Vector Processing Unit (VPU) is a 7-way VLIW and Wide-SIMD vector processor. It supports multi-dimension address generation (up to six dimensions) in hardware to allow for efficient looping and address updates with zero overhead.

In general, programming vector code on the VPU is comprised of two parts:

Defining the dimensions/iterations of the processing loop via the Address Generators (AGEN).
Selecting an optimal set of vector ops in the loop body to process data.

This tutorial demonstrates some basic VPU programming concepts by converting the scalar convolution code from RDF Halo Configuration Tutorial into vector code. For the first part of this tutorial, we show how to represent the convolution filter’s four-dimensional looping structure, and pointer address updates, as a series of AGEN initializations. For the second part of this tutorial, we show how to replace the scalar operations in the loop body with vector operations. For the third part of this tutorial, we show how the AGEN initialization, circular buffer maintenance, and convolution kernel functions are called within the VPU main function.

Overall, in this tutorial we are learning about:

Vector data types
Vector arithmetic operations
Vector load/store operations
Multi-dimensional looping via the VPU Address Generator (AGEN) unit

Device Code#

When converting a scalar loop to a vector loop on the VPU, you are mapping the declaration of the “for” loop iterations and memory address calculations to the VPU’s AGEN units, and the arithmetic operations to their equivalent vector operations on the VPU.

Defining Your Multi-Dimensional Loop and Address Calculations (via AGENs)#

In this example, we initialize three address generators, one for loading the source image, a second for loading filter coefficients, and a third for storing the output image. These address generators are configured separately in an initialization function:
```
void convolution_cb_knl_8b_init(uint8_t *src, int32_t srcLinePitch, int16_t *knl, uint8_t *dst, int32_t dstLinePitch,
                                int32_t tileWidth, int32_t tileHeight, int32_t kernelWidth, int32_t kernelHeight,
                                int32_t qbits, int32_t inCbSize, int32_t outCbSize, convolutionCbAgens &agens)
{
```
Creating an initialization function for your AGEN configurations is considered good practice for VPU programming as many of the settings can be reused across tiles. In many cases, only the tile start address needs to be reconfigured for each kernel call.

Address generators are defined via the AgenWrapper and agen types as follows:

    agens.niter           = kernelWidth * kernelHeight * (tileWidth / pva_elementsof(vchar)) * tileHeight;
    agens.niter_acc_reset = kernelWidth * kernelHeight;

    agen in1, in2, out;
    AgenWrapper wrapper1, wrapper2, wrapper3;

niter is the total iteration count across all dimensions of the loop, and niter_acc_reset is the number of iterations that are run before the MAC accumulator is reset.

We assign our data pointers to an AGEN unit via the “init” function, as shown below:
```
    in1 = init((vchar *)src);
```
```
    in2 = init((dvshort *)knl);
```
```
    out = init((vchar *)dst);
```
Where AgenWrapper is a helper structure for configuring AGENs that contains fields for the iteration counts for 6 levels of multidimensional looping (.n<X>), address updates in terms of stride for each dimension (.s<X>), data type for address increments (.size), round bits (.round), saturation bounds (.sat<lim/val>_<lo/hi>).

The “n<X>” field of the AgenWrapper defines the iteration counts for each dimension of the loop. In this case we have a 4-dimensional loop so we are making declarations for fields n1, n2, n3, and n4. These values are defined in the code as follows:

    wrapper1.n1 = kernelWidth;  /* "for (int32_t i = 0; i < KERNEL_WIDTH; i++)" */
    wrapper1.n2 = kernelHeight; /* "for (int32_t j = 0; j < KERNEL_HEIGHT; j++)" */
    wrapper1.n3 =
        (tileWidth /
         pva_elementsof(
             vchar)); /* "for (int32_t x = 0; x < TILE_WIDTH/pva_elementsof(vchar); x+=pva_elementsof(vchar))" */
    wrapper1.n4 = tileHeight; /* "for (int32_t y = 0; y < TILE_HEIGHT; y++)" */

These same values are applied for the src, coeff, & dst agens.

The “s<X>” field of the AgenWrapper defines the strides, in units of elements, to advance in each dimension. In this case we have a 4-dimensional loop so we are making declarations for fields s1, s2, s3, and s4.
The following code shows the setup of the stride fields for the src pointer address calculation:
```
    wrapper1.size = sizeof(uint8_t);
```

The following code shows the setup of the stride fields for the kernel pointer address calculation:

    /* knl[j * KERNEL_WIDTH + i] */
    wrapper2.s1 = 1;           /* "+i" component of the coeff ptr address calculation */
    wrapper2.s2 = kernelWidth; /* "j*kernelWidth" component of the coeff ptr address calculation */
    wrapper2.s3 = 0;
    wrapper2.s4 = 0;

The following code shows the setup of the stride fields for the dst pointer address calculation:
```
    wrapper3.size = sizeof(uint8_t);
```

The “size” field of the AgenWrapper sets the number of bytes per element.

    wrapper1.size = sizeof(uint8_t);

    wrapper2.size = sizeof(int16_t);

    wrapper3.size = sizeof(uint8_t);

The src and dst pointers use circular buffering. We configure the agen units to be aware of this via update_agen_cb_start() and update_agen_cb_size(), which define the starting address of the buffer, and its size.

    in1 = update_agen_cb_start(
        in1, (intptr_t)src); /* sets cbStart for "% srcCircularBufLen" component of address calculation */
    in1 = update_agen_cb_size(
        in1, inCbSize); /* sets srcCircularBufLen for "% srcCircularBufLen" component of address calculation */

    out = update_agen_cb_start(
        out, (intptr_t)dst); // sets cbStart for "% dstCircularBufLen" component of address calculation
    out = update_agen_cb_size(
        out, outCbSize); // sets dstCircularBufLen for "% dstCircularBufLen" component of address calculation

The cb size that is set via update_agen_cb_size is in units of “bytes.”

Output data is stored as 8-bit, so we must round/saturate the data down to 8-bit unsigned precision. The destination AGEN is configured to round/saturate the accumulator output. This is done via the sat_opt, sat_lim_lo, sat_val_lo, sat_lim_hi, sat_val_hi fields of the “out” AGEN. See below how these saturation parameters are defined for the AGEN:
```
    out.round      = qbits; /* outputPixelAccumulator = ((outputPixelAccumulator >> (quantizationBits - 1)) + 1) >> 1 */
    out.sat_opt    = 3;
    out.sat_lim_lo = 0;
    out.sat_val_lo = 0; /* outputPixelAccumulator = min(0, outputPixelAccumulator) */
    out.sat_lim_hi = 255;
    out.sat_val_hi = 255; /* outputPixelAccumulator = max(outputPixelAccumulator, 255) */
```
The sat_opt field tells the AGEN whether or not to perform saturation. The options are as follows: * sat_opt = 0 — default, no saturation is performed, * sat_opt = 1 — no saturation is performed, * sat_opt = 2 — signed saturation, and * sat_opt = 3 — unsigned saturation.
Once the AGEN fields are defined the INIT_AGEN<X>() macro (4D AGEN init macro) is called to initialize the AGEN unit with the parameters we defined in the AgenWrapper.
```
    INIT_AGEN4(in1, wrapper1);
```
```
    INIT_AGEN4(in2, wrapper2);
```
```
    INIT_AGEN4(out, wrapper3);
```

Vector Code#

In the previous section, we defined the dimensions of our convolution loop, and the pointer arithmetic for the src, dst, and kernel data. In this section we define the vector operations that comprise the body of the loop. The operations used in the convolution loop body consists of vector loads, stores, and multiply accumulates (MACs). The VPU instruction set architecture provides various flavors of these operations to support different combinations of input/output datatypes. The naming convention for VPU intrinsics incorporates the supported datatypes of the instruction. The VPU word length is 32-bit, so intrinsics that use 32-bit integer type are often suffixed with “w” for “word.” Similarly, 16-bit intrinsics use “h” for half-word, 8-bit intrinsics use “b” for byte. Starting with PVA GEN2, floating point vector operations are available and use suffix of “f” for 32-bit float or “hf” for 16-bit float. Operations that widen often have a combination of two suffixes, for example “hw” means “half-word” input that widens to “word” output.

The convolution kernel function convolution_cb_knl_8b() is called with the preconfigured AGEN parameters convAgen as its input. The AGEN configuration is defined in a structure as follows:
```
struct convolutionCbAgens
{
    AgenCFG cfg[3];
    int32_t niter;
    int32_t niter_acc_reset;
};
```
AgenCFG holds the AGEN configuration, niter is the total iteration count across all dimensions of the loop, and niter_acc_reset is the number of iterations that are run before the MAC accumulator is reset. When entering the convolution kernel function, the AGEN configurations are extracted and assigned to their corresponding AGEN units via the init_agen<X>_from_cfg() function as shown below, were “X” refers to the memory bank (A, B, or C) the particular AGEN unit is accessing.
```
    agen_A in1 = init_agen_A_from_cfg(agens.cfg[0]);
    agen_C in2 = init_agen_C_from_cfg(agens.cfg[1]);
    agen_B out = init_agen_B_from_cfg(agens.cfg[2]);
```
The vector code is contained within a single “for” loop who’s iteration count is the totality of all the iterations from each nested loop level of the original scalar code from RDF Halo Configuration Tutorial. In other words, we have collapsed the four-level nested loop down to one-level, and now rely on the AGEN units we just setup in Part 1 to perform the correct pointer indexing at each iteration of the loop. Therefore, we defined the iteration count of the collapsed loop as \(niter = kernelWidth * kernelHeight * (tileWidth / pva_elementsof(vchar)) * tileHeight\).
```
    for (int32_t i = 0; i < agens.niter; i++) chess_prepare_for_pipelining
    {
```
The source 8-bit data is loaded from memory and promoted to a 16-bit double vector by a load with promotion instruction. A 16-bit double vector has 32 SIMD lanes where the lower 16 lanes are accessed via vreg.lo and the 16 upper lanes are accessed via vreg.hi. The convolution performs MACs with 16-bit coefficients, so we promote the 32 8-bit pixels that we loaded from input memory into a 32-way 16-bit double vector. The 16-bit coefficients are loaded from memory via the signed vector load for half-words.
```
        dvdataInH = vuchar_dvshortx_load(in1);
        vcoef     = vshort_load_hs(in2);
```
We use the half-word (h) version of the “Vector Multiply-Add with Clear Accumulator” instruction (vmaddh) to do this.
```
        /* outputPixelAccumulator += (int32_t)sourcePixel * coefficient;
         * outputPixelAccumulator is reset to zero via "pred_madd" every niter_acc_reset iterations */
        dvacc.lo = vmaddh(dvdataInH.lo, vcoef, dvacc.lo, 0, pred_madd);
        dvacc.hi = vmaddh(dvdataInH.hi, vcoef, dvacc.hi, 0, pred_madd);
```
Output of this instruction is a 16-bit vector. It is called twice to compute the upper and lower halfs of the 16-bit double vector. The fourth argument to the vmaddh instruction allows us to set a per MAC rounding. For this example we are only rounding at the end of the 5x5 MAC so this parameter is set to zero. The fifth argument to the vmaddh instruction, pred_madd, allows us to control when the accumulator is cleared. The value of pred_madd is incremented by one at each loop iteration. When its value equates to \(niter_in = kernelWidth*kernelHeight\) (number of iterations required to compute the 5x5 filter output), it is reset to zero. When the predicate is off, only multiply-round is performed, which effectively clears the accumulator. pred_madd is updated via the mod_inc instruction, which allows us to do modulo increments with a single instruction.
The filter output is stored via the vstore_hb instruction, which is predicated to store only after the end of each niter_in iterations.
```
        vstore_hb(dvacc, out, pred_store); /* Note: vstore_hb supported only on Gen2 or above devices */
```
pred_madd is updated via the mod_inc instruction, which allows us to do modulo increments with a single instruction.
```
        pred_madd = mod_inc(pred_madd, agens.niter_acc_reset - 1);
```
pred_store is controlled via call to the mod_inc_pred_z instruction, which sets the pred_store predicate based on the counter value after modulo increment.
```
        s_ctrl_store = mod_inc_pred_z(s_ctrl_store, agens.niter_acc_reset - 1, pred_store);
    }
```

VPU Main#

Below is the code for invoking the vector convolution kernel. These function calls are surrounded by the same DMA triggering code from RDF Halo Configuration Tutorial.

convolution_cb_knl_8b_init() initializes the AGENs into a configuration structure convAgen The AGEN configurations for the loop, and data pointer are setup in this function.

    convolution_cb_knl_8b_init(&inputTileBufferVMEM[0], srcLinePitch, kernel, &outputTileBufferVMEM[0], dstLinePitch,
                               TILE_WIDTH, TILE_HEIGHT, KERNEL_WIDTH, KERNEL_HEIGHT, quantizationBits,
                               srcCircularBufLen * sizeof(char), dstCircularBufLen * sizeof(char), convAgen);

    for (int32_t i = 0; i < tileCount; i++)
    {
        uint8_t *inputTile  = (uint8_t *)cupvaRasterDataFlowAcquire(sourceDataFlowHandler);
        uint8_t *outputTile = (uint8_t *)cupvaRasterDataFlowAcquire(destinationDataFlowHandler);

This is called only once as all the configuration settings remain the same except for the src/dst address. The src/dst address is updated for each tile with a separate function called cupvaModifyAgenCfgBase().

cupvaModifyAgenCfgBase() is called to update the address of the src and dst tiles to reflect the RasterDataFlow offset updates. The new addresses are obtained from calling cupvaRasterDataFlowAcquire().

        /* advance the input agen to the next tile */
        cupvaModifyAgenCfgBase(&convAgen.cfg[0], inputTile);

        /* advance the output agen to the next tile */
        cupvaModifyAgenCfgBase(&convAgen.cfg[2], outputTile);

convolution_cb_knl_8b() calls the convolution vector kernel with the agen “convAgen” passed as its only argument.
```
        convolution_cb_knl_8b(convAgen);
```