ROI Align Layer on VPU#

The ROI Align layer is used in Convolutional Neural Networks (CNNs) to more precisely extract ROI candidates from image feature maps. As reviewed in ROI Gather using GSDF Tutorial, the output of the ROI Align layer is a concatenated set of ROI feature maps that are given to subsequent CNNs for detection, segmentation, and/or classification. The ROI Align layer takes as its input a FW x FH x C feature-map and a list of ROI coordinates (where FW is feature-map width, FH is feature-map height, and C is the number of channels). The ROI Align layer operates on the feature-map data regions that correspond to these ROI locations. ROI Gather using GSDF Tutorial demonstrated how to gather ROI data via the cuPVA GatherScatter Dataflow (GSDF) APIs. Sampler APIs Tutorial demonstrated how to bilinearly sample ROI feature-map data with fractional coordinates via the DLUT Unit. This tutorial fully implements an ROI Align layer via vectorized code on the VPU.

Device Code#

The ROI Align algorithm parameters are contained within a data structure that is defined in a common header file. The structure is initialized on the host side and its parameters are used in the device code.

typedef struct _RoiAlignParams
{
    void *feature_map;    /* pointer to input feature map */
    void *roi_buffer;     /* pointer to ROI information list */
    void *output_data;    /* pointer to output ROI data */
    int output_size;      /* pooled width * pooled height * num of channels * num of roi's*/
    float spatial_scale;  /* a scaling factor that maps ROI image coordinates to feature map coordinates */
    int channels;         /* number of channels in input feature map*/
    int height;           /* height of input feature map */
    int width;            /* width of input feature map*/
    int pooled_height;    /* ROI height after pooling */
    int pooled_width;     /* ROI width after pooling */
    float sampling_ratio; /* number of sampling points */
    int num_rois;         /* number of ROIs */
    int batch_count;      /* number of batches */

} RoiAlignParams;

These are populated by arguments given from the user on the command-line.

The following VMEM buffer is used to hold the ROI coordinates transfered from DRAM.

VMEM(A, float, roi_buffer, MAX_ROIS * ROI_INFO_ENTRY_SIZE);

This buffer contains the ROI coordinates interleaved in the following format: [batch idx, top-left x, top-left y, bottom-right x, bottom-right y]

This data is deinterleaved into the following separate buffers for SIMD processing on the VPU.

VMEM(A, float, roi_top_left_x_buf, MAX_ROIS);
VMEM(A, float, roi_top_left_y_buf, MAX_ROIS);
VMEM(A, float, roi_start_global_w_buf, MAX_ROIS);
VMEM(A, float, roi_start_global_h_buf, MAX_ROIS);
VMEM(A, float, roi_bottom_right_x_buf, MAX_ROIS);
VMEM(A, float, roi_bottom_right_y_buf, MAX_ROIS);

The following buffers are used to store the feature-map tile dimensions for SIMD processing on the VPU. These are used to update the SQDF tile dimensions and addressses.

VMEM(A, int32_t, feature_map_tile_w_buf, MAX_ROIS);
VMEM(A, int32_t, feature_map_tile_h_buf, MAX_ROIS);
VMEM(C, int32_t, feature_map_roi_start_idx_buf, MAX_ROIS);
VMEM(C, int32_t, feature_map_roi_start_idx_buf2, ROIS_PER_BATCH *MAX_CHANNELS_PER_BATCH);

The following buffers are used to store ROI Align bin sizes for SIMD processing on the VPU.
```
VMEM(A, float, bin_size_h_buf, MAX_ROIS);
VMEM(A, float, bin_size_w_buf, MAX_ROIS);
```
To create a fixed-size feature-map output for a ROI its corresponding feature-map data is fetched, with its area subdivided into [pooled_width x pooled_height] bins for sampling by the DLUT unit. The average of sampled points from each bin (via average pooling) becomes the fixed-sized “pooled_width x pooled_height” output feature-map.
The following buffer is declared to store the pooled outputs in VMEM.
```
VMEM(C, float, output_data, OUTPUT_BUF_SIZE);
```

The handles used for ROI coordinates and pooled output transfers are declared here.

VMEM_SEQDF_HANDLER(B, pooledFeatureMapTrig);
VMEM_SEQDF_HANDLER(A, coords_trig);

We allocate VMEM space to hold the ROI Align algorithm parameter variables/pointers that are assigned by the host (via user-defined command-line args).

VMEM(A, int32_t, output_size);
VMEM(A, float, spatial_scale);
VMEM(A, int32_t, channels);
VMEM(A, int32_t, height);
VMEM(A, int32_t, width);
VMEM(A, int32_t, pooled_height);
VMEM(A, int32_t, pooled_width);
VMEM(A, float, sampling_ratio);
VMEM(A, int32_t, batch_count);
VMEM_POINTER(A, feature_map_dram_addr);

Since we do not know the ROI coordinates until runtime, the cuPVA SequenceDataFlow APIs are used to facilitate transfers of the feature map data for each ROI. To use the SQDF APIs, we must first initialize the sequence dataflow. The dataflow holds the sequence data transfer information in a specified location in VMEM that the PVA DMA engine accesses. At the beginning of CUPVA_VPU_MAIN we initialize the SQDF dataflow via a call to cupvaSQDFOpen().
```
CUPVA_VPU_MAIN()
{
```
```
    int32_t feature_map_base_addr = cupvaGetVmemAddress(feature_map_buf);
    cupvaSQDFOpen(vpu_cfg_tbl);
```

In this tutorial, the majority of the processing is done via vectorized code on the VPU. Therefore, we must prepare the input data to be processed in a SIMD fashion. The ROI coordinates are stored in an interleaved format in memory. The top-left and bottom-right x,y coordinates are de-interleaved into separate memories. Also, the ROIs are processed in batches of 16, so for each ROI batch we must determine the feature-map fetch dimensions that are needed to collect all the feature-map data that encompasses the ROI locations. This is accomplished in the following function:

    roiAlignScaleROICoordsAndCalcBatchFetchDims();

The information describing each ROI is in the following format: {batch_idx , roi_top_left_x_buf , roi_top_left_y_buf , roi_bottom_right_x_buf , roi_bottom_right_y_buf}

We want to deinterleave these into seperate buffers for SIMD processing. We can do this via the Sampler APIs by configuring it to perform 1D-lookups that pick up every 5th entry and storing each of these out consecutively in the output buffer.

        roiAlignSamplerDeinterleaveCoords(&sampler, &roi_buffer[n * ROI_INFO_ENTRY_SIZE], &roi_top_left_x_buf[n], 1,
                                          vecw, coord_deinterleave_tbl);
        cupvaSamplerStart(&sampler);
        cupvaSamplerWait();

Since the ROI coordinates are set relative to the original input image’s dimensions. They must be scaled down to the dimensions of the feature-map. This is done via the following vector operations before they are stored back out into separate memories.

        // scale coords
        dvfloat *roi_sw;
        roi_sw                    = (dvfloat *)((void *)&roi_top_left_x_buf[n]);
        dvfloatx v_roi_top_left_x = sign_extend(*roi_sw);
        v_roi_top_left_x          = v_roi_top_left_x * v_spatical_scale;

        // store coords
        roi_sw  = (dvfloat *)((void *)&roi_top_left_x_buf[n]);
        *roi_sw = extract(v_roi_top_left_x);

        roi_sw  = (dvfloat *)((void *)&roi_start_global_w_buf[n]);
        *roi_sw = extract(v_roi_top_left_x);

The dimensions of each batch is calculated via a series of min/max operations.

        dvintx roi_top_left_x_int     = (dvintx)(v_roi_top_left_x);
        dvintx roi_top_left_y_int     = (dvintx)(v_roi_top_left_y);
        dvintx roi_bottom_right_x_int = ((dvintx)(v_roi_bottom_right_x)) + 1;

        int32_t min_x = min(vminr_s(roi_top_left_x_int.lo), vminr_s(roi_top_left_x_int.hi));
        int32_t min_y = min(vminr_s(roi_top_left_y_int.lo), vminr_s(roi_top_left_y_int.hi));

        int32_t max_x = max(vmaxr_s(roi_bottom_right_x_int.lo), vmaxr_s(roi_bottom_right_x_int.hi));
        int32_t max_w = max_x - min_x + 3;
        int32_t max_h = max_x - min_y + 3;

        if ((min_x + max_w) > width)
        {
            max_w = max_w - (max_w + min_x - width);
        }

        if ((min_y + max_h) > height)
        {
            max_h = max_h - (max_h + min_y - height);
        }

        if (max_w < 0)
            max_w = 1;

        if (max_h < 0)
            max_h = 1;

        dvfloatx roi_start_local_w = v_roi_top_left_x - (float)min_x;
        dvfloatx roi_start_local_h = v_roi_top_left_y - (float)min_y;

        dvfloat *roi_stw, *roi_sth;
        roi_stw  = (dvfloat *)((void *)&roi_top_left_x_buf[n]);
        *roi_stw = extract(roi_start_local_w);
        roi_sth  = (dvfloat *)((void *)&roi_top_left_y_buf[n]);
        *roi_sth = extract(roi_start_local_h);

        dvintx v_tmp_w, v_tmp_h, v_min_y;
        v_tmp_w.lo = replicatew(max_w);
        v_tmp_w.hi = replicatew(max_w);
        v_tmp_h.lo = replicatew(max_h);
        v_tmp_h.hi = replicatew(max_h);
        v_min_y.lo = replicatew(min_y);
        v_min_y.hi = replicatew(min_y);

        *(dvint *)((int32_t *)&feature_map_tile_h_buf[n]) = extract(v_tmp_h);
        *(dvint *)((int32_t *)&feature_map_tile_w_buf[n]) = extract(v_tmp_w);

The batch ID of each ROI is then set and stored.

        dvintx roi_start_idx_vec = dvmulw(width_vec, v_min_y, 0);
        roi_start_idx_vec        = roi_start_idx_vec + min_x;

        dvintx batch_value                                       = dvmulw(batch_idx_vec, channel_vec, 0);
        batch_value                                              = dvmulw(batch_value, feature_map_res_vec, 0);
        batch_value                                              = batch_value + roi_start_idx_vec;
        batch_value                                              = dvmulw(batch_value, type_size_vec, 0);
        *(dvint *)((int32_t *)&feature_map_roi_start_idx_buf[n]) = extract(batch_value);

The ROI Align processing is done within a three-level loop on the VPU. The outermost loop iterates over ROI batches to be extracted from the input feature-map.

    int32_t roi_offset = 0;

    int32_t offset_precalc = sampling_ratio * sampling_ratio * pooled_width * pooled_height;
    int32_t n_rois         = output_size / channels / pooled_width / pooled_height;

    int32_t total_rois_processed = 0;

    while (roi_offset < n_rois)
    {

The base address of the feature-map batch is set via a call to:

    uint64_t feature_map_addr = feature_map_src_dram_addr + feature_map_roi_start_idx_buf2[roi_start_offset];
    cupvaSQDFUpdateAddr(vpu_cfg_tbl, 0, feature_map_addr, MEMTYPE_DRAM, width * sizeof(float),
                        feature_map_base_addr, MEMTYPE_VMEM, feature_map_tile_w_buf[roi_idx] * sizeof(float));
    cupvaSQDFUpdateTileSize(vpu_cfg_tbl, 0, feature_map_tile_w_buf[roi_idx] * sizeof(float), feature_map_tile_h_buf[roi_idx]);

We pre-calculate the DLUT indices for bilinear sampling via the following function call:

        int32_t num_roi_processed_in_batch = roiAlignConvertROICoordsToDLUTSamplingCoords(
            roi_top_left_y_buf, roi_top_left_x_buf, bin_size_h_buf, bin_size_w_buf, n_rois, roi_offset);

In this function, the following code iterates along the bin width and height (pw x ph), and writes out the indices computed for each bin to the roi_feature_map_sampler_sampling_indices buffer.

    // Load the roi start and size information across 16 rois
    dvfloatx v_roi_top_left_y = sign_extend(*((dvfloat *)(&roi_top_left_y_buf[n])));
    dvfloatx v_bin_size_h     = sign_extend(*((dvfloat *)(&bin_size_h_buf[n])));
    dvfloatx v_roi_top_left_x = sign_extend(*((dvfloat *)(&roi_top_left_x_buf[n])));
    dvfloatx v_bin_size_w     = sign_extend(*((dvfloat *)(&bin_size_w_buf[n])));

    dvfloat *roi_ptr;
    roi_ptr                       = (dvfloat *)((void *)&roi_start_global_h_buf[n]);
    dvfloatx v_roi_start_global_h = sign_extend(*roi_ptr);
    roi_ptr                       = (dvfloat *)((void *)&roi_start_global_w_buf[n]);
    dvfloatx v_roi_start_global_w = sign_extend(*roi_ptr);

    for (int32_t ph = 0; ph < pooled_height; ph++) // iterate along bin height
    {
        for (int32_t pw = 0; pw < pooled_width; pw++) // iterate along bin width
        {
            for (int32_t iy = 0; iy < sampling_ratio; iy++) // get the sampling indices for each bin
            {

                dvfloatx vtemp =
                    (v_bin_size_h * (float)ph) + ((v_bin_size_h * (iy + .5f)) * (1.0 / (float)sampling_ratio));

                dvfloatx yy = v_roi_top_left_y + vtemp;

                dvfloatx yy_oor = v_roi_start_global_h + vtemp;

                check1 = yy_oor < 0.0;
                yy     = dvmux(check1, 0.0, yy);
                check2 = yy_oor >= (float)(height - 1);
                yy     = dvmux(check2, (dvfloatx)((dvintx)yy), yy);

                check1_oor = yy_oor < -1.0;
                yy         = dvmux(check1_oor, 5000.0, yy);
                check2_oor = yy_oor > (height);
                yy         = dvmux(check2_oor, 5000.0, yy);

                dvintx y_index = (dvintx)(yy * idx_out_qbit);

                for (int32_t ix = 0; ix < sampling_ratio; ix++)
                {
                    vtemp = (v_bin_size_w * (float)pw) + ((v_bin_size_w * (ix + .5f)) * (1.0 / (float)sampling_ratio));

                    dvfloatx xx = v_roi_top_left_x + vtemp;

                    dvfloatx xx_oor = v_roi_start_global_w + vtemp;

                    check1 = xx_oor < 0.0;
                    xx     = dvmux(check1, 0.0, xx);
                    check2 = xx_oor >= (float)(width - 1);
                    xx     = dvmux(check2, (dvfloatx)((dvintx)xx), xx);

                    check1_oor = xx_oor < -1.0;
                    xx         = dvmux(check1_oor, 5000.0, xx);
                    check2_oor = xx_oor > (width);
                    xx         = dvmux(check2_oor, 5000.0, xx);

                    dvintx x_index = (dvintx)(xx * idx_out_qbit);

                    vstore_transp(x_index, out_idx_x);
                    vstore_transp(y_index, out_idx_y);
                }
            }
        }
    }

The output of this function is used in the inner-most loop when the Sampler task is started for the ROI feature-map sampling.

The mid-level loop iterates over each channel in the batch.
```
        for (int32_t c = 0; c < channels; c++)
        {
```
At this level, we make updates to the ROI fetch locations via the SQDF and trigger their transfers, and we convert the transfer data to fixed-point for bilinear sampling via Sampler APIs.

We update the SQDF config table with the address of the current channel in the batch via the following function call:

            roiAlignUpdateSQDFSrcAndDstAddr(c * ROIS_PER_BATCH, total_rois_processed, feature_map_base_addr,
                                           (uint64_t)(feature_map_dram_addr.base + feature_map_dram_addr.offset));

In this function, the start address of each batch channel is calculated as follows:

    uint64_t feature_map_addr = feature_map_src_dram_addr + feature_map_roi_start_idx_buf2[roi_start_offset];
    cupvaSQDFUpdateAddr(vpu_cfg_tbl, 0, feature_map_addr, MEMTYPE_DRAM, width * sizeof(float),
                        feature_map_base_addr, MEMTYPE_VMEM, feature_map_tile_w_buf[roi_idx] * sizeof(float));
    cupvaSQDFUpdateTileSize(vpu_cfg_tbl, 0, feature_map_tile_w_buf[roi_idx] * sizeof(float), feature_map_tile_h_buf[roi_idx]);

Once the addresses are updated in the config table we trigger the transfer of the feature-map batch for the current channel and wait for its completion.
```
            cupvaSQDFFlushAndTrig(vpu_cfg_tbl);
            cupvaSQDFSync(vpu_cfg_tbl);
```

The following vector loop converts the feature-map data to Qformat since the DLUT unit entrys must be fixed-point.

            for (i = 0; (i + vecw - 1) < feature_map_size; i += vecw) chess_loop_range(3, )
            chess_unroll_loop(3)
            chess_prepare_for_pipelining
            {
                in_pixel  = dvfloat_load(in);
                out_pixel = (dvintx)(in_pixel * (map_out_qbit));

                vstore(out_pixel, out);
            }

The inner-most loop iterates over each ROI in the feature-map batch for the current channel.

            for (int32_t roi_idx = 0; roi_idx < num_roi_processed_in_batch; roi_idx++)
            {

We now follow the same DLUT initialization step that was done in the previous tutorial via Sampler APIs.

                CupvaSamplerInput2D const input = {
                    .data           = feature_map_buf_fixed_pt,
                    .type           = SAMPLER_INPUT_TYPE_S32,
                    .width          = (uint32_t)(feature_map_tile_w_buf[total_rois_processed + roi_idx] - 1),
                    .height         = (uint32_t)(feature_map_tile_h_buf[total_rois_processed + roi_idx] - 1),
                    .linePitch      = (uint32_t)feature_map_tile_w_buf[total_rois_processed + roi_idx],
                    .outOfRangeMode = SAMPLER_OUT_OF_RANGE_CONSTANT,
                };

                CupvaSamplerIndices2D const indices = {
                    .data   = &roi_feature_map_sampler_sampling_indices[(MAX_SAMPLING_POINTS * 2 + 1) * roi_idx],
                    .type   = SAMPLER_INDEX_TYPE_U32,
                    .width  = (uint16_t)offset_precalc,
                    .height = 1U,
                    .fractionalBits     = INDEX_QBIT,
                    .fractionalHandling = SAMPLER_FRAC_HANDLING_INTERPOLATE,
                    .interleaving       = SAMPLER_INTERLEAVING_ELEMENTS,
                };

                CupvaSamplerOutput const output = {
                    .data      = &interp_buf[0],
                    .transMode = TRANS_MODE_NONE,
                };

                cupvaSamplerSetup(&samplerInner, &input, &indices, &output);

                cupvaSamplerStart(&samplerInner);
                cupvaSamplerWait();

Average pooling is performed to create the fixed-size output feature-map for each ROI.
```
                roiAlign4x4AvePoolQ();
```

We trigger the transfer of the pooled data.

                cupvaSQDFTrig(pooledFeatureMapTrig);
                cupvaSQDFSync(pooledFeatureMapTrig);

Once all of the ROI batches have been processed the SQDF stream is closed as follows:
```
        }
    }
    cupvaSQDFClose(vpu_cfg_tbl);
```

Host Code#

C++

We begin the host code by creating syncpoints and the stream objects for our ROI Align application.

        SyncObj postSync = SyncObj::Create(true);
        Fence postFence{postSync};

        Stream stream = Stream::Create();

Buffers to hold the input and output data are allocated in DRAM using the cuPVA memory alloc API.

        roiAlignParamsDevice.feature_map = (void *)mem::Alloc(
            (roiAlignParamsDevice.channels * roiAlignParamsDevice.height * roiAlignParamsDevice.width) * sizeof(float));
        roiAlignParamsDevice.roi_buffer =
            (void *)mem::Alloc((roiAlignParamsDevice.num_rois * ROI_INFO_ENTRY_SIZE * sizeof(float)));
        roiAlignParamsDevice.output_data = (void *)mem::Alloc(
            (roiAlignParamsDevice.num_rois * roiAlignParamsDevice.channels * roiAlignParamsDevice.pooled_width *
             roiAlignParamsDevice.pooled_height * sizeof(float)));

This application’s inputs (feature-map & ROI coordinates) are defined in CSV files located in the tutorial/assets directory. The input data is copied into the DRAM memory we just allocated via the ReadCSVFloatBuffer() function from “ImageIO.h” utils.

        // Get pointer to input arrays so test data can be copied into them
        roiAlignParamsHost.feature_map = (void *)mem::GetHostPointer(roiAlignParamsDevice.feature_map);
        roiAlignParamsHost.roi_buffer  = (void *)mem::GetHostPointer(roiAlignParamsDevice.roi_buffer);

        float *roi_buf_ptr = (float *)roiAlignParamsHost.roi_buffer;
        if (ReadCSVFloatBuffer(roiBufferData.c_str(), assetsDirectory, roi_buf_ptr,
                               roiAlignParamsDevice.num_rois * ROI_INFO_ENTRY_SIZE) < 0)
        {
            return 0;
        }

        float *feature_map_ptr = (float *)roiAlignParamsHost.feature_map;
        if (ReadCSVFloatBuffer(
                featureMapData.c_str(), assetsDirectory, feature_map_ptr,
                roiAlignParamsDevice.width * roiAlignParamsDevice.height * roiAlignParamsDevice.channels) < 0)
        {
            return 0;
        }

The expected output (sampled ROI feature-map data) is defined in a CSV file located in the tutorial/assets directory. The output reference data is loaded into host memory via ReadCSVFloatBuffer() for comparison against the VPU generated output.

        // Get pointer to output array for comparison against reference
        roiAlignParamsHost.output_data = (void *)mem::GetHostPointer(roiAlignParamsDevice.output_data);
        float output_value_c_ref[4096];

        roiAlignParamsDevice.output_size = roiAlignParamsDevice.num_rois * roiAlignParamsDevice.channels *
                                           roiAlignParamsDevice.pooled_width * roiAlignParamsDevice.pooled_height;

        if (roiAlignParamsDevice.output_size > 4096)
        {
            std::cout << "Error: output size greater than maximum allowed\n";
            return 0;
        }

        if (ReadCSVFloatBuffer(outputDataRef.c_str(), assetsDirectory, output_value_c_ref,
                               roiAlignParamsDevice.output_size) < 0)
        {
            return 0;
        }

The ROI Align algorithm’s parameters are passed to the VPU by the host. These parameters are defined via the following command line options:
```
"-num_roi 64 -pw 4 -ph 4 -ch 4 -scale 0.25 -sampling_ratio 4 -fw 204 -fh 128 -batch 1"
```
Note

The value of the command-line parameters shown above produces the expected outputs to be verified at the end of the program run. The user can experiment with changes to these parameters with the expectation that the outputs would no longer match the reference data used in the verification portion of the test.

The parameters are set in the command program object via the following initialization function:

        // Setup Dataflows & assign algo parameters for device code to use
        cupva::Executable exec    = CreateROIAlignExec();
        cupva::CmdProgram cmdProg = CreateROIAlignProg(exec, &roiAlignParamsDevice);

The Executable and CmdProgram objects are created similar to the previous tutorials.

cupva::Executable CreateROIAlignExec()
{
    return Executable::Create(PVA_EXECUTABLE_DATA(roi_align_dev),
                              PVA_EXECUTABLE_SIZE(roi_align_dev));
}

cupva::CmdProgram CreateROIAlignProg(const Executable &exec, RoiAlignParams *params)
{

    cupva::CmdProgram m_cmdProg = CmdProgram::Create(exec);

Since the device-side code needs to know the parameters for processing the feature-map, they are set in the program object by the host as follows:

    m_cmdProg["output_size"]    = (int)params->output_size;
    m_cmdProg["spatial_scale"]  = (float)params->spatial_scale;
    m_cmdProg["channels"]       = (int)params->channels;
    m_cmdProg["height"]         = (int)params->height;
    m_cmdProg["width"]          = (int)params->width;
    m_cmdProg["pooled_height"]  = (int)params->pooled_height;
    m_cmdProg["pooled_width"]   = (int)params->pooled_width;
    m_cmdProg["sampling_ratio"] = (float)params->sampling_ratio;
    m_cmdProg["batch_count"]    = (int)params->batch_count;

The base addresses are set for the input and output data arrays here:

    m_cmdProg["feature_map_dram_addr"] = params->feature_map;

We now configure the DataFlows for transferring the buffer of ROI coordinates and output feature maps. These buffers are transferred via the cuPVA Sequence DataFlow (SQDF) APIs.

    int32_t num_rois = params->num_rois;

    auto coords_trig  = m_cmdProg["coords_trig"];
    float *roi_buffer = m_cmdProg["roi_buffer"].ptr<float>();

    auto pooledFeatureMapTrig = m_cmdProg["pooledFeatureMapTrig"];
    float *output_data        = m_cmdProg["output_data"].ptr<float>();

    SequenceDataFlow &roiBufferDF = m_cmdProg.addDataFlowHead<SequenceDataFlow>().handler(coords_trig);
    roiBufferDF.addTransfer()
        .tile(num_rois * 5 * sizeof(float))
        .src(params->roi_buffer, 1)
        .dst(roi_buffer, 1)
        .mode(TransferModeType::CONTINUOUS);

    SequenceDataFlow &outputDataDF = m_cmdProg.addDataFlowHead<SequenceDataFlow>().handler(pooledFeatureMapTrig);
    outputDataDF.addTransfer()
        .src(output_data, 1)
        .srcDim1(ROIS_PER_BATCH, 0)
        .srcDim2(params->channels, 0)
        .srcDim3(num_rois / ROIS_PER_BATCH, 0)
        .dst(params->output_data, 1)
        .dstDim1(ROIS_PER_BATCH, params->channels * params->pooled_width * params->pooled_height * sizeof(float))
        .dstDim2(params->channels, params->pooled_width * params->pooled_height * sizeof(float))
        .dstDim3(num_rois / ROIS_PER_BATCH,
                ROIS_PER_BATCH * params->channels * params->pooled_width * params->pooled_height * sizeof(float))
        .tile(params->pooled_width * params->pooled_height * sizeof(float))
        .mode(TransferModeType::TILE);

    uint32_t roi_processed = num_rois / ROIS_PER_BATCH * ROIS_PER_BATCH;
    uint32_t roi_rem       = num_rois - roi_processed;
    if (roi_rem != 0)
    {
        outputDataDF.addTransfer()
            .src(output_data, 1)
            .srcDim1(roi_rem, 0)
            .srcDim2(params->channels, 0)
            .dst((float *)(params->output_data) +
                 (roi_processed * params->channels * params->pooled_width * params->pooled_height), 1)
            .dstDim1(roi_rem, params->channels * params->pooled_width * params->pooled_height * sizeof(float))
            .dstDim2(params->channels, params->pooled_width * params->pooled_height * sizeof(float))
            .tile(params->pooled_width * params->pooled_height * sizeof(float))
            .mode(TransferModeType::TILE);
    }

In a real application, the location of the ROIs are not known at compile time. Therefore, we must use the SequenceDataFlow (SQDF) APIs as they facilitate setting src/dst addresses of DMA transfers at runtime. To use the SQDF APIs we first declare the SQDF head.
```
    SequenceDataFlow &featureMapDF = m_cmdProg.addDataFlowHead<SequenceDataFlow>();
```
We then declare the sequence dataflow handle that the VPU uses to maintain the addresses and trigger the SQDF transfers.
```
    auto vpu_cfg_tbl = m_cmdProg["vpu_cfg_tbl"];
    featureMapDF.handler(vpu_cfg_tbl);
```

We now set the basic parameters of the SQDF transfer. We set the src pointer to be the base of the input feature-map buffer in DRAM, and the destination pointer to be the ping variant of the feature-map tile in VMEM.

    float *feature_map_buf = m_cmdProg["feature_map_buf"].ptr<float>();

    featureMapDF.addTransfer()
        .src(params->feature_map, params->width * sizeof(float))
        .dst(feature_map_buf, params->width * sizeof(float))
        .tile(params->width * sizeof(float), params->height)
        .mode(TransferModeType::CONTINUOUS);

The src, dst, and tile parameters are updated on the device side during runtime via some cuPVA SQDF APIs that update those fields in the SQDF transfer table.

Now that the dataflows have been declared, they are now ready to be compiled on the host.
```
    m_cmdProg.compileDataFlows();
```

We now add a post fence for syncing with VPU completion, and submit the cmd stream to the VPU.

        CmdRequestFences f{postFence};
        cmdProg.updateDataFlows();
        stream.submit({&cmdProg, &f});

We wait for the VPU to finish processing using these sync points.
```
        postFence.wait();
```

We verify the output with the following mem comparison.

        int32_t errNum = 0;
        for (int num = 0; num < (roiAlignParamsDevice.num_rois * roiAlignParamsDevice.channels *
                                 roiAlignParamsDevice.pooled_width * roiAlignParamsDevice.pooled_height);
             num++)
        {
            float *output_data_ptr = (float *)roiAlignParamsHost.output_data;

            if (std::fabs(output_data_ptr[num] - output_value_c_ref[num]) > 0.01)
            {
                errNum = 1;
                std::cout << "\nMismatch at num " << num << " abs(opt-ref) -- " << output_data_ptr[num] << "\t"
                          << output_value_c_ref[num] << std::endl;
            }
        }

We then delete the allocated resources for cleanup.

        mem::Free(roiAlignParamsDevice.feature_map);
        mem::Free(roiAlignParamsDevice.roi_buffer);
        mem::Free(roiAlignParamsDevice.output_data);

The tutorial code is run on the command line as follows:

./roi_align_cpp -a <Tutorial Assets Directory Path> -num_roi 64 -pw 4 -ph 4 -ch 4 -scale 0.25 -sampling_ratio 4 -fw 204 -fh 128 -batch 1

You see “Test Pass” reported upon successful execution of the code.

C

We begin the host code by creating syncpoints and the stream objects for our ROI Align application.

    cupvaError_t syncResourceErr = CUPVA_ERROR_NONE;
    cupvaSyncObj_t postSync;
    syncResourceErr = CupvaSyncObjCreate(&postSync, true, CUPVA_SIGNALER_WAITER, CUPVA_SYNC_YIELD);
    if (syncResourceErr != CUPVA_ERROR_NONE)
    {
        printf("ROI Align: Sync object creation failed = %d", syncResourceErr);
        return 1;
    }

    cupvaFence_t postFence;
    syncResourceErr = CupvaFenceInit(&postFence, postSync);
    if (syncResourceErr != CUPVA_ERROR_NONE)
    {
        printf("ROI Align: Fence creation failed = %d", syncResourceErr);
        return 1;
    }

    cupvaStream_t stream;
    syncResourceErr = CupvaStreamCreate(&stream, CUPVA_PVA0, CUPVA_VPU0);
    if (syncResourceErr != CUPVA_ERROR_NONE)
    {
        printf("ROI Align: Stream creation failed = %d", syncResourceErr);
        return 1;
    }
    cupvaCmdStatus_t cmdstatus[2] = {NULL};

Buffers to hold the input and output data are allocated in DRAM using the cuPVA memory alloc API.

    CupvaMemAlloc(
        (void **)&roiAlignParamsDevice.feature_map,
        (roiAlignParamsDevice.channels * roiAlignParamsDevice.height * roiAlignParamsDevice.width) * sizeof(float),
        CUPVA_READ_WRITE, CUPVA_ALLOC_DRAM);

    CupvaMemAlloc((void **)&roiAlignParamsDevice.roi_buffer,
                  (roiAlignParamsDevice.num_rois * ROI_INFO_ENTRY_SIZE * sizeof(float)), CUPVA_READ_WRITE,
                  CUPVA_ALLOC_DRAM);

    CupvaMemAlloc((void **)&roiAlignParamsDevice.output_data,
                  (roiAlignParamsDevice.num_rois * roiAlignParamsDevice.channels * roiAlignParamsDevice.pooled_width *
                   roiAlignParamsDevice.pooled_height * sizeof(float)),
                  CUPVA_READ_WRITE, CUPVA_ALLOC_DRAM);

This application’s inputs (feature-map & ROI coordinates) are defined in CSV files located in the tutorial/assets directory. The input data is copied into the DRAM memory we just allocated via the ReadCSVFloatBuffer() function from “ImageIO.h” utils.

    // Get pointer to input arrays so test data can be copied into them
    CupvaMemGetHostPointer((void **)&roiAlignParamsHost.feature_map, roiAlignParamsDevice.feature_map);
    CupvaMemGetHostPointer((void **)&roiAlignParamsHost.roi_buffer, roiAlignParamsDevice.roi_buffer);

    // Copy test data into allocated arrays in DRAM
    float *roi_buf_ptr = (float *)roiAlignParamsHost.roi_buffer;
    if (ReadCSVFloatBuffer(ROI_BUF_DATA_FILE, assetsDirectory, roi_buf_ptr,
                           roiAlignParamsDevice.num_rois * ROI_INFO_ENTRY_SIZE) < 0)
    {
        return 0;
    }

    float *feature_map_ptr = (float *)roiAlignParamsHost.feature_map;
    if (ReadCSVFloatBuffer(FEATURE_MAP_DATA_FILE, assetsDirectory, feature_map_ptr,
                           roiAlignParamsDevice.width * roiAlignParamsDevice.height * roiAlignParamsDevice.channels) <
        0)
    {
        return 0;
    }

The expected output (sampled ROI feature-map data) is defined in a CSV file located in the tutorial/assets directory. The output reference data is loaded into host memory via ReadCSVFloatBuffer() for comparison against the VPU generated output.

    CupvaMemGetHostPointer((void **)&roiAlignParamsHost.output_data, roiAlignParamsDevice.output_data);
    float output_value_c_ref[4096];
    roiAlignParamsDevice.output_size = roiAlignParamsDevice.num_rois * roiAlignParamsDevice.channels *
                                       roiAlignParamsDevice.pooled_width * roiAlignParamsDevice.pooled_height;

    if (roiAlignParamsDevice.output_size > 4096)
    {
        printf("Error: output size greater than maximum allowed\n");
        return 0;
    }

    if (ReadCSVFloatBuffer(OUTPUT_DATA, assetsDirectory, output_value_c_ref, 4096) < 0)
    {
        return 0;
    }

The ROI Align algorithm’s parameters are passed to the VPU by the host. These parameters are defined via the following command line options:
```
"-num_roi 64 -pw 4 -ph 4 -ch 4 -scale 0.25 -sampling_ratio 4 -fw 204 -fh 128 -batch 1"
```
Note

The value of the command-line parameters shown above produces the expected outputs to be verified at the end of the program run. The user can experiment with changes to these parameters with the expectation that the outputs would no longer match the reference data used in the verification portion of the test.

They are set in the command program object via the following initialization function:

    // Setup Dataflows & assign algo parameters for device code to use
    cupvaCmd_t ROI_AlignCmdProg;
    cupvaExecutable_t ROI_AlignExec;

    Initialize(&roiAlignParamsDevice, &ROI_AlignCmdProg, &ROI_AlignExec);

The Executable and CmdProgram objects are created similar to the previous tutorials.

    CupvaExecutableCreate(exec, PVA_EXECUTABLE_DATA(roi_align_dev),
                          PVA_EXECUTABLE_SIZE(roi_align_dev));
    CupvaCmdProgramCreate(cmdProg, *exec);

Since the device-side code needs to know the parameters for processing the feature-map, they are set in the program object by the host as follows:

    cupvaParameter_t param_output_size, param_spatial_scale, param_channels, param_height, param_width,
        param_pooled_height, param_pooled_width, param_sampling_ratio, param_batch_count, param_feature_map_dram_addr;

    int batch_count = params->batch_count;

    CUPVA_CHECK_ERROR_RETURN(CupvaCmdProgramGetParameter(cmdProg, &param_output_size, "output_size"));
    CUPVA_CHECK_ERROR_RETURN(CupvaCmdProgramGetParameter(cmdProg, &param_spatial_scale, "spatial_scale"));
    CUPVA_CHECK_ERROR_RETURN(CupvaCmdProgramGetParameter(cmdProg, &param_channels, "channels"));
    CUPVA_CHECK_ERROR_RETURN(CupvaCmdProgramGetParameter(cmdProg, &param_height, "height"));
    CUPVA_CHECK_ERROR_RETURN(CupvaCmdProgramGetParameter(cmdProg, &param_width, "width"));
    CUPVA_CHECK_ERROR_RETURN(CupvaCmdProgramGetParameter(cmdProg, &param_pooled_height, "pooled_height"));
    CUPVA_CHECK_ERROR_RETURN(CupvaCmdProgramGetParameter(cmdProg, &param_pooled_width, "pooled_width"));
    CUPVA_CHECK_ERROR_RETURN(CupvaCmdProgramGetParameter(cmdProg, &param_sampling_ratio, "sampling_ratio"));
    CUPVA_CHECK_ERROR_RETURN(CupvaCmdProgramGetParameter(cmdProg, &param_batch_count, "batch_count"));

    CUPVA_CHECK_ERROR_RETURN(
        CupvaParameterSetValueScalar(&param_output_size, &(params->output_size), sizeof(params->output_size)));
    CUPVA_CHECK_ERROR_RETURN(
        CupvaParameterSetValueScalar(&param_spatial_scale, &(params->spatial_scale), sizeof(params->spatial_scale)));
    CUPVA_CHECK_ERROR_RETURN(
        CupvaParameterSetValueScalar(&param_channels, &(params->channels), sizeof(params->channels)));
    CUPVA_CHECK_ERROR_RETURN(CupvaParameterSetValueScalar(&param_height, &(params->height), sizeof(params->height)));
    CUPVA_CHECK_ERROR_RETURN(CupvaParameterSetValueScalar(&param_width, &(params->width), sizeof(params->width)));
    CUPVA_CHECK_ERROR_RETURN(
        CupvaParameterSetValueScalar(&param_pooled_height, &(params->pooled_height), sizeof(params->pooled_height)));
    CUPVA_CHECK_ERROR_RETURN(
        CupvaParameterSetValueScalar(&param_pooled_width, &(params->pooled_width), sizeof(params->pooled_width)));
    CUPVA_CHECK_ERROR_RETURN(
        CupvaParameterSetValueScalar(&param_sampling_ratio, &(params->sampling_ratio), sizeof(params->sampling_ratio)));
    CUPVA_CHECK_ERROR_RETURN(CupvaParameterSetValueScalar(&param_batch_count, &batch_count, sizeof(batch_count)));

The base address in DRAM for the feature-map is set here:

    CUPVA_CHECK_ERROR_RETURN(
        CupvaCmdProgramGetParameter(cmdProg, &param_feature_map_dram_addr, "feature_map_dram_addr"));
    CUPVA_CHECK_ERROR_RETURN(CupvaParameterSetValuePointer(&param_feature_map_dram_addr, params->feature_map));

We now configure the DataFlows for transferring the buffer of ROI coordinates and output feature maps. These buffers are transferred via the cuPVA Sequence DataFlow (SQDF) APIs.
```
    SetupRoiAndOutputDF(params, cmdProg);
```

In a real application, the location of the ROIs is not known at compile time. Therefore, we must use the SequenceDataFlow (SQDF) APIs as they facilitate setting src/dst addresses of DMA transfers at runtime. To use the SQDF APIs we first declare the SQDF head.

    CUPVA_CHECK_ERROR_RETURN(CupvaCmdProgramAddDataFlowHead(cmdProg, &featureMapDF, CUPVA_SEQUENCE_DATAFLOW, 0, 1.0F));
    cupvaSequenceDataFlowParams_t featureMapDFParams = {};
    cupvaSequenceDataFlowTransferParams_t featureMapDFTransferParams = {};

We then declare the sequence dataflow handle that the VPU uses to maintain the addresses and trigger the SQDF transfers.

    cupvaParameter_t vpu_cfg_tbl;
    CUPVA_CHECK_ERROR_RETURN(CupvaCmdProgramGetParameter(cmdProg, &vpu_cfg_tbl, "vpu_cfg_tbl"));
    featureMapDFParams.handler = &vpu_cfg_tbl;

We now set the basic parameters of the SQDF transfer. We set the src pointer to be the base of the input feature-map buffer in DRAM, and the destination pointer to be the ping variant of the feature-map tile in VMEM.

    featureMapDFTransferParams.ptrSrc       = params->feature_map;
    featureMapDFTransferParams.linePitchSrc = params->width * sizeof(float);
    featureMapDFTransferParams.ptrDst       = feature_map_buf;
    featureMapDFTransferParams.linePitchDst = params->width * sizeof(float);
    featureMapDFTransferParams.tileWidth    = params->width * sizeof(float);
    featureMapDFTransferParams.tileHeight   = params->height;
    featureMapDFTransferParams.transferMode = CUPVA_TRANSFER_MODE_CONTINUOUS;

    CupvaSequenceDataFlowSetParams(featureMapDF, &featureMapDFParams);
    CupvaSequenceDataFlowAddTransfer(featureMapDF, &featureMapDFTransferParams);

The src, dst, and tile parameters are updated on the device side during runtime via some cuPVA SQDF APIs that updates those fields in the SQDF transfer table.

Now that the dataflows have been declared, they are now ready to be compiled on the host.
```
    CupvaCmdProgramCompileDataFlows(cmdProg);
```

We now add a post fence for syncing with VPU completion, and submit the cmd stream to the VPU.

    cupvaCmd_t cmdFenceReq;
    CupvaCmdRequestFencesInit(&cmdFenceReq, fence, 1);
    cupvaError_t err = CUPVA_ERROR_NONE;

    cupvaCmd_t const *submitCmds[2] = {cmdProg, &cmdFenceReq};
    err = CupvaStreamSubmit(stream, submitCmds, cmdstatus, 2, orderType, execTimeoutUs, submitTimeoutUs);

    if (err != CUPVA_ERROR_NONE)
    {
        printf("\n Roi Align: failed to submit commands with %d", err);
    }

We wait for the VPU to finish processing using these sync points.

    bool fenceWaitStatus;
    CupvaFenceWait(&postFence, -1, &fenceWaitStatus);

    cupvaError_t statusCode = {CUPVA_ERROR_NONE};
    syncResourceErr         = CupvaCheckCommandStatus(cmdstatus[0], &statusCode);
    if (syncResourceErr != CUPVA_ERROR_NONE)
    {
        printf("ROI Align:command status error = %d\n", syncResourceErr);
    }

We verify the output with the following mem comparison.

    int32_t errNum = 0;
    for (int num = 0; num < (roiAlignParamsDevice.num_rois * roiAlignParamsDevice.channels *
                             roiAlignParamsDevice.pooled_width * roiAlignParamsDevice.pooled_height);
         num++)
    {
        float *output_data_ptr = (float *)roiAlignParamsHost.output_data;

        if (fabs(output_data_ptr[num] - output_value_c_ref[num]) > 0.01)
        {
            errNum = 1;
            printf("\nMismatch at num %d abs(opt-ref) -- %f %f", num, output_data_ptr[num], output_value_c_ref[num]);
        }
    }

We then delete the allocated resources for cleanup.

    CupvaMemFree(roiAlignParamsDevice.feature_map);
    CupvaMemFree(roiAlignParamsDevice.roi_buffer);
    CupvaMemFree(roiAlignParamsDevice.output_data);

    CupvaStreamDestroy(stream);
    CupvaSyncObjDestroy(postSync);
    CupvaCmdDestroy(&ROI_AlignCmdProg);
    CupvaExecutableDestroy(ROI_AlignExec);

The tutorial code is run on the command line as follows:

./roi_align_c -a <Tutorial Assets Directory Path> -num_roi 64 -pw 4 -ph 4 -ch 4 -scale 0.25 -sampling_ratio 4 -fw 204 -fh 128 -batch 1

You see “Test Pass” reported upon successful execution of the code.