ROI Align Optimization on PVA#

This tutorial expands upon ROI Align Layer on VPU Tutorial’s functional implementation of the ROI Align Layer by adding the following optimizations:

PingPong Buffering for DMA
DMA Phases
Running Sampler tasks on DLUT concurrently with VPU

Device Code#

Double Buffering Feature-Map ROI Batches#

In Sampler APIs Tutorial, the transfer of batched ROI input and output feature-map data was not overlapped with any VPU processing. Therefore, the latencies associated with the transfer of data between DRAM and VMEM were exposed. We modify the code in this tutorial so that we can trigger these transfers and have them running while the VPU is processing other data.

We start by adding the following ping/pong buffers to store the feature-map ROI batch in VMEM.
```
VMEM(C, float, feature_map_buf_ping, (MAX_FEATURE_MAP_HEIGHT * MAX_FEATURE_MAP_WIDTH));
VMEM(B, float, feature_map_buf_pong, (MAX_FEATURE_MAP_HEIGHT * MAX_FEATURE_MAP_WIDTH));
int32_t *feature_map_buf_ping_fixed_pt = (int32_t *)(&feature_map_buf_ping[0]);
int32_t *feature_map_buf_pong_fixed_pt = (int32_t *)(&feature_map_buf_pong[0]);
```
These two buffers allow for processing of a feature-map batch for the current iteration in one of the memories while the transfer for the next iteration is occurring in the other memory.

Updating the ping/pong buffer usage in between iterations is done via the pingpong_channel variable.

    while (roi_offset < n_rois)
    {
        pingpong_channel = !pingpong_channel;

We provide the SQDF with the destination address of its first feature-map ROI transfer of the current batch.

        roiAlignCalculateFeatureMapSQDFSrcAddr(roi_offset);
        int32_t feature_map_dst_pong_addr  = cupvaGetVmemAddress(feature_map_buf_pong);
        uint64_t feature_map_src_dram_addr = feature_map_dram_addr.base + feature_map_dram_addr.offset;
        roiAlignUpdateDFSrcAndDstAddr(0, total_rois_processed, pingpong_channel, feature_map_base_addr,
                                       feature_map_dst_pong_addr, feature_map_src_dram_addr);

In contrast with Sampler APIs Tutorial, note that the address update function now takes into account which ping/pong buffer in VMEM it uses as the transfer destination address.

We now trigger the first SQDF transfer. Note that we use cupvaSQDFFlushAndTrig() to first update the transfer information, then trigger the dataflow.
```
        cupvaSQDFFlushAndTrig(vpu_cfg_tbl);
```

While the first feature-map transfer of the batch is occurring, we have the VPU calculate the DLUT sampling indices for the ROIs.

        int32_t num_roi_processed_in_batch = roiAlignConvertROICoordsToDLUTSamplingCoords(
            roi_top_left_y, roi_top_left_x, bin_size_h_buf, bin_size_w_buf, n_rois, roi_offset);

After calculating the indices, we check for completion of the first batch transfer.

        cupvaSQDFSync(vpu_cfg_tbl);

In the middle loop, we start by triggering the feature-map batch transfer for the next channel.

        for (int32_t c = 0; c < channels; c++)
        {
            if ((c + 1) != channels)
            {
                roiAlignUpdateDFSrcAndDstAddr((c + 1) * ROIS_PER_BATCH, total_rois_processed, (!pingpong_channel),
                                               feature_map_base_addr, feature_map_dst_pong_addr,
                                               feature_map_src_dram_addr);
                cupvaSQDFFlushAndTrig(vpu_cfg_tbl);
            }

            pingpong_input  = !pingpong_input;
            pingpong_output = pingpong_input;

            const float *offset_bottom_data;
            int32_t *offset_bottom_data_fixed_point;

            if (pingpong_channel == 0)
            {
                offset_bottom_data             = feature_map_buf_ping;
                offset_bottom_data_fixed_point = feature_map_buf_ping_fixed_pt;
            }
            else
            {
                offset_bottom_data             = feature_map_buf_pong;
                offset_bottom_data_fixed_point = feature_map_buf_pong_fixed_pt;
            }

While this transfer is occurring, the DLUT is sampling feature-map ROI data, and the VPU is performing average pooling.

            for (int32_t roi_idx = 0; roi_idx < num_roi_processed_in_batch; roi_idx++)
            {

                if ((roi_idx + 1) != num_roi_processed_in_batch)
                {
                    pingpong_input = !pingpong_input;

                    roiAlignBilinearInterpSamplerStart(&samplerInterp, offset_bottom_data,
                                                       offset_bottom_data_fixed_point, offset_precalc,
                                                       total_rois_processed, roi_idx + 1, pingpong_input, false);
                }

                roiAlign4x4AvePoolQ(pingpong_output * MAX_SAMPLING_POINTS);

                if ((roi_idx + 1) != num_roi_processed_in_batch)
                {
                    cupvaSamplerWait();
                }

                pingpong_output = !pingpong_output;

Once the Sampler task and the pooling operations have completed, we check that the transfer of the next iteration’s feature-map ROI data is completed.

            if ((c + 1) != channels)
            {
                cupvaSQDFSync(vpu_cfg_tbl);
            }

            pingpong_channel = !pingpong_channel;

Concurrent DLUT + VPU#

The Sampler APIs are used in the ROI Align Layer implementation for two purposes:

Deinterleave the ROI location data into separate memory locations
Sample ROI feature-map data from fractional coordinates via bilinear interpolation.

While the DLUT unit accelerates these operations, the fact that it can run concurrently with VPU processing provides additional opportunities for speedup at the application level.

The ROI coordinate information is stored in an interleaved format as follows: { batch_idx , roi_top_left_x , roi_top_left_y , roi_bottom_right_x , roi_bottom_right_y }

In the roiAlignScaleROICoordsAndCalcBatchFetchDims() function coordinate entries are deinterleaved (via DLUT) into the following separate memories:
```
VMEM(A, float, roi_top_left_x, MAX_ROIS);
VMEM(A, float, roi_top_left_y, MAX_ROIS);
VMEM(A, float, roi_start_global_w_buf, MAX_ROIS);
VMEM(A, float, roi_start_global_h_buf, MAX_ROIS);
VMEM(A, float, roi_bottom_right_x_buf, MAX_ROIS);
VMEM(A, float, roi_bottom_right_y_buf, MAX_ROIS);
```

We optimize the performance of this function by overlapping the scaling of the coordinate list with its deinterleaving. For example, as the first step, we begin by setting the parameters of the Sampler task for deinterleaving the top-left x coordinates, start the Sampler task, and then wait for its completion.

        roiAlignSamplerDeinterleaveCoords(&sampler, &roi_buffer[n * ROI_INFO_ENTRY_SIZE], &roi_top_left_x[n], 1, vecw,
                                          coord_deinterleave_tbl);
        cupvaSamplerStart(&sampler);
        cupvaSamplerWait();

We then set the Sampler task parameters for deinterleaving the top-left y coordinates and start that task.

        roiAlignSamplerDeinterleaveCoords(&sampler, &roi_buffer[n * ROI_INFO_ENTRY_SIZE], &roi_top_left_y[n], 2, vecw,
                                          coord_deinterleave_tbl);
        cupvaSamplerStart(&sampler);

While the Sampler task for top-left y deinterleaving is running, we scale the top-left x coordinates on the VPU.

        dvfloat *roi_sw;
        roi_sw                    = (dvfloat *)((void *)&roi_top_left_x[n]);
        dvfloatx v_roi_top_left_x = sign_extend(*roi_sw);
        v_roi_top_left_x          = v_roi_top_left_x * v_spatical_scale;

        roi_sw  = (dvfloat *)((void *)&roi_top_left_x[n]);
        *roi_sw = extract(v_roi_top_left_x);

        roi_sw  = (dvfloat *)((void *)&roi_start_global_w_buf[n]);
        *roi_sw = extract(v_roi_top_left_x);

After the scaling operations on the VPU have finished, we check to see if the Sampler task has completed.
```
        cupvaSamplerWait();
```
At this point we have delinterleaved the top-left x/y coordinates, and scaled the top-left x coordinates. This pattern of overlap for DLUT+VPU operations is followed to deinterleave and scale the remaining entries in the coordinate list.

Since ROI data from the feature-map could be located at indices with a fractional component, bilinear interpolation (via DLUT) is used to obtain their values. We overlap the DLUT bilinear interpolation task with VPU average pooling by doing the interpolation for the points in the next iteration while performing average pooling on the interpolated data for the current iteration. Here we initiate the DLUT bilinear interpolation task for the next iteration’s feature-map ROI data:

            for (int32_t roi_idx = 0; roi_idx < num_roi_processed_in_batch; roi_idx++)
            {

                if ((roi_idx + 1) != num_roi_processed_in_batch)
                {
                    pingpong_input = !pingpong_input;

                    roiAlignBilinearInterpSamplerStart(&samplerInterp, offset_bottom_data,
                                                       offset_bottom_data_fixed_point, offset_precalc,
                                                       total_rois_processed, roi_idx + 1, pingpong_input, false);
                }

While this Sampler task is running, we perform average pooling on the current iteration’s feature-map ROI data.
```
                roiAlign4x4AvePoolQ(pingpong_output * MAX_SAMPLING_POINTS);
```

After pooling has completed we trigger the transfer of its output and wait for the completion of the Sampler task.

                cupvaSQDFTrig(pooledFeatureMapTrig);

                if ((roi_idx + 1) != num_roi_processed_in_batch)
                {
                    cupvaSamplerWait();
                }

                pingpong_output = !pingpong_output;

We now wait for the completion of the output transfer.

                cupvaSQDFSync(pooledFeatureMapTrig);
            }

Once all of the ROI batches have been processed the SQDF stream is closed as follows:
```
        }
    }
    cupvaSQDFClose(vpu_cfg_tbl);
```

Host Code#

C++

We begin the host code by creating syncpoints and the stream objects for our ROI Align application.

        SyncObj postSync = SyncObj::Create(true);
        Fence postFence{postSync};

        Stream stream = Stream::Create();

Input and output reference data are loaded into host memory in the same manner as ROI Align Layer on VPU Tutorial. Buffers to hold the input and output data are allocated in DRAM using the cuPVA memory alloc API.

        roiAlignParamsDevice.feature_map = (void *)mem::Alloc(
            (roiAlignParamsDevice.channels * roiAlignParamsDevice.height * roiAlignParamsDevice.width) * sizeof(float));
        roiAlignParamsDevice.roi_buffer =
            (void *)mem::Alloc((roiAlignParamsDevice.num_rois * ROI_INFO_ENTRY_SIZE * sizeof(float)));
        roiAlignParamsDevice.output_data = (void *)mem::Alloc(
            (roiAlignParamsDevice.num_rois * roiAlignParamsDevice.channels * roiAlignParamsDevice.pooled_width *
             roiAlignParamsDevice.pooled_height * sizeof(float)));

The ROI Align algorithm’s parameters are passed to the VPU by the host. These parameters are defined via the following command line options:
```
"-num_roi 64 -pw 4 -ph 4 -ch 4 -scale 0.25 -sampling_ratio 4 -fw 204 -fh 128 -batch 1"
```
Note

The value of the command-line parameters shown above produces the expected outputs to be verified at the end of the program run. The user can experiment with changes to these parameters with the expectation that the outputs no longer match the reference data used in the verification portion of the test. They are set in the command program object via the following initialization function:
```
        cupva::Executable exec    = CreateROIAlignExec();
        cupva::CmdProgram cmdProg = CreateROIAlignProg(exec, &roiAlignParamsDevice);
```
The Executable and CmdProgram objects are created similar to the previous tutorials.
```
    cupva::CmdProgram m_cmdProg = CmdProgram::Create(exec);
```

Since the device-side code needs to know the parameters for processing the feature-map, they are set in the program object by the host as follows:

    m_cmdProg["output_size"]    = (int)params->output_size;
    m_cmdProg["spatial_scale"]  = (float)params->spatial_scale;
    m_cmdProg["channels"]       = (int)params->channels;
    m_cmdProg["height"]         = (int)params->height;
    m_cmdProg["width"]          = (int)params->width;
    m_cmdProg["pooled_height"]  = (int)params->pooled_height;
    m_cmdProg["pooled_width"]   = (int)params->pooled_width;
    m_cmdProg["sampling_ratio"] = (float)params->sampling_ratio;
    m_cmdProg["batch_count"]    = (int)params->batch_count;

We now configure the DataFlows for transferring the buffer of ROI coordinates and output feature maps.

    int32_t num_rois = params->num_rois;

    auto coords_trig  = m_cmdProg["coords_trig"];
    float *roi_buffer = m_cmdProg["roi_buffer"].ptr<float>();

    auto pooledFeatureMapTrig = m_cmdProg["pooledFeatureMapTrig"];
    float *output_data        = m_cmdProg["output_data"].ptr<float>();

    /** [start_SQDF_Phases] */
    constexpr int32_t p0{1};
    constexpr int32_t p1{2};
    SequenceDataFlow &roiBufferDF  = m_cmdProg.addDataFlowHead<SequenceDataFlow>(p0).handler(coords_trig);
    SequenceDataFlow &outputDataDF = m_cmdProg.addDataFlowHead<SequenceDataFlow>(p1).handler(pooledFeatureMapTrig);
    /** [end_SQDF_Phases] */

    roiBufferDF.addTransfer()
        .tile(num_rois * 5 * sizeof(float))
        .src(params->roi_buffer, 1)
        .dst(roi_buffer, 1)
        .mode(TransferModeType::CONTINUOUS);

    outputDataDF.addTransfer()
        .tile(params->pooled_width * params->pooled_height * sizeof(float))
        .src(output_data, 1)
        .srcDim1(ROIS_PER_BATCH, 0)
        .srcDim2(params->channels, 0)
        .srcDim3(num_rois / ROIS_PER_BATCH, 0)
        .dst(params->output_data, 1)
        .dstDim1(ROIS_PER_BATCH, params->channels * params->pooled_width * params->pooled_height * sizeof(float))
        .dstDim2(params->channels, params->pooled_width * params->pooled_height * sizeof(float))
        .dstDim3(num_rois / ROIS_PER_BATCH,
                 ROIS_PER_BATCH * params->channels * params->pooled_width * params->pooled_height * sizeof(float))
        .mode(TransferModeType::TILE);

    uint32_t roi_processed = num_rois / ROIS_PER_BATCH * ROIS_PER_BATCH;
    uint32_t roi_rem       = num_rois - roi_processed;
    if (roi_rem != 0)
    {
        outputDataDF.addTransfer()
            .tile(params->pooled_width * params->pooled_height * sizeof(float))
            .src(output_data, 1)
            .srcDim1(roi_rem, 0)
            .srcDim2(params->channels, 0)
            .dst((float *)(params->output_data) +
                 (roi_processed * params->channels * params->pooled_width * params->pooled_height), 1)
            .dstDim1(roi_rem, params->channels * params->pooled_width * params->pooled_height * sizeof(float))
            .dstDim2(params->channels, params->pooled_width * params->pooled_height * sizeof(float))
            .mode(TransferModeType::TILE);
    }

Note

We have added a declaration of phases to the ROI info list and pooled feature-map dataflows. Since these two dataflows do not run simultaneously, separating them into distinct phases allows for additional optimization of the DMA channel resources.

We now configure the Sequence DataFlows for transferring feature-map ROI data.

    SequenceDataFlow &featureMapDF = m_cmdProg.addDataFlowHead<SequenceDataFlow>();

    auto vpu_cfg_tbl = m_cmdProg["vpu_cfg_tbl"];
    featureMapDF.handler(vpu_cfg_tbl);

    float *feature_map_ping = m_cmdProg["feature_map_buf_ping"].ptr<float>();

    featureMapDF.addTransfer()
        .src(params->feature_map, params->width * sizeof(float))
        .dst(feature_map_ping, params->width * sizeof(float))
        .tile(params->width * sizeof(float), params->height)
        .mode(TransferModeType::CONTINUOUS);

Now that the dataflows have been declared, they are now ready to be compiled on the host, a required step as mentioned in previous tutorials.
```
    m_cmdProg.compileDataFlows();
```

The base addresses are set for the input and output data arrays here:

    m_cmdProg["feature_map_dram_addr"] = params->feature_map;

We now add a post fence for syncing with VPU completion, and submit the cmd stream to the VPU.

        CmdRequestFences f{postFence};
        cmdProg.updateDataFlows();
        stream.submit({&cmdProg, &f});

We wait for the VPU to finish processing using these sync points.
```
        postFence.wait();
```

We verify the output with the following mem comparison.

        int32_t errNum = 0;
        for (int num = 0; num < (roiAlignParamsDevice.num_rois * roiAlignParamsDevice.channels *
                                 roiAlignParamsDevice.pooled_width * roiAlignParamsDevice.pooled_height);
             num++)
        {
            float *output_data_ptr = (float *)roiAlignParamsHost.output_data;

            if (std::fabs(output_data_ptr[num] - output_value_c_ref[num]) > 0.01)
            {
                errNum = 1;
                std::cout << "\nMismatch at num " << num << " abs(opt-ref) -- " << output_data_ptr[num] << "\t"
                          << output_value_c_ref[num] << std::endl;
            }
        }

We then delete the allocated resources for cleanup.

        mem::Free(roiAlignParamsDevice.feature_map);
        mem::Free(roiAlignParamsDevice.roi_buffer);
        mem::Free(roiAlignParamsDevice.output_data);

The tutorial code is run on the command line as follows:

./optimized_roi_align_c -a <Tutorial Assets Directory Path> -num_roi 64 -pw 4 -ph 4 -ch 4 -scale 0.25 -sampling_ratio 4 -fw 204 -fh 128 -batch 1

You see “Test Pass” reported upon successful execution of the code.

C

We begin the host code by creating syncpoints and the stream objects for our ROI Align application.

    cupvaError_t syncResourceErr = CUPVA_ERROR_NONE;
    cupvaSyncObj_t postSync;
    syncResourceErr = CupvaSyncObjCreate(&postSync, true, CUPVA_SIGNALER_WAITER, CUPVA_SYNC_YIELD);
    if (syncResourceErr != CUPVA_ERROR_NONE)
    {
        printf("ROI Align: Sync object creation failed = %d", syncResourceErr);
        return 1;
    }

    cupvaFence_t postFence;
    syncResourceErr = CupvaFenceInit(&postFence, postSync);
    if (syncResourceErr != CUPVA_ERROR_NONE)
    {
        printf("ROI Align: Fence creation failed = %d", syncResourceErr);
        return 1;
    }

    cupvaStream_t stream;
    syncResourceErr = CupvaStreamCreate(&stream, CUPVA_PVA0, CUPVA_VPU0);
    if (syncResourceErr != CUPVA_ERROR_NONE)
    {
        printf("ROI Align: Stream creation failed = %d", syncResourceErr);
        return 1;
    }
    cupvaCmdStatus_t cmdstatus[2] = {NULL};

Input and output reference data are loaded into host memory in the same manner as ROI Align Layer on VPU Tutorial. Buffers to hold the input and output data are allocated in DRAM using the cuPVA memory alloc API.

    CupvaMemAlloc(
        (void **)&roiAlignParamsDevice.feature_map,
        (roiAlignParamsDevice.channels * roiAlignParamsDevice.height * roiAlignParamsDevice.width) * sizeof(float),
        CUPVA_READ_WRITE, CUPVA_ALLOC_DRAM);

    CupvaMemAlloc((void **)&roiAlignParamsDevice.roi_buffer,
                  (roiAlignParamsDevice.num_rois * ROI_INFO_ENTRY_SIZE * sizeof(float)), CUPVA_READ_WRITE,
                  CUPVA_ALLOC_DRAM);

    CupvaMemAlloc((void **)&roiAlignParamsDevice.output_data,
                  (roiAlignParamsDevice.num_rois * roiAlignParamsDevice.channels * roiAlignParamsDevice.pooled_width *
                   roiAlignParamsDevice.pooled_height * sizeof(float)),
                  CUPVA_READ_WRITE, CUPVA_ALLOC_DRAM);

The ROI Align algorithm’s parameters are passed to the VPU by the host. These parameters are defined via the following command line options:

"-num_roi 64 -pw 4 -ph 4 -ch 4 -scale 0.25 -sampling_ratio 4 -fw 204 -fh 128 -batch 1"

They are set in the command program object via the following initialization function:

    cupvaCmd_t ROI_AlignCmdProg;
    cupvaExecutable_t ROI_AlignExec;

    Initialize(&roiAlignParamsDevice, &ROI_AlignCmdProg, &ROI_AlignExec);

The Executable and CmdProgram objects are created similar to the previous tutorials.

    CupvaExecutableCreate(exec, PVA_EXECUTABLE_DATA(optimized_roi_align_dev),
                          PVA_EXECUTABLE_SIZE(optimized_roi_align_dev));
    CupvaCmdProgramCreate(cmdProg, *exec);