PVA Accelerated Primitives Library (APL) - Harris Corners#

The PVA SDK contains a device-side library for some common primitive operations on tiles. This library leverages specialized hardware on Thor which can run in parallel with the VPU. On Orin, the same functionality is implemented using the VPU, which allows source-compatibility between PVA hardware generations.

This tutorial demonstrates how to use the PVA accelerated primitives library to perform Harris Corner detection following by non-maximum suppression. The input is a 2D tile resident in VMEM. We use the pva_device_testlib CMake function to directly execute device-side code without using cuPVA host APIs directly.

For a complete API reference for the PVA-APL, refer to PVA Accelerated Primitives Library.

Device Code#

  1. To make PVA APL available, we explicitly include the public header, pva_apl.h.

    #include "HarrisDetect.h"
    
    #include <cupva_device.h>
    #include <pva_apl.h>
    
  2. We now declare some temporary buffers. First, we declare the intermediate buffer which holds the Harris output prior to NMS.

    #define QBITS 5
    #define LAMBDA 0.05f
    
    // Some APL functions will output fixed blocks of rows. The Harris Corner API is one such function
    // and requires us to round the rows up to a multiple of 16
    #define ROUND_16(_x_) (((_x_) + 15) / 16 * 16)
    VMEM(A, int32_t, harrisOutput, ROUND_16(NMS_IN_TILE_H) * NMS_IN_TILE_W);
    
  3. Additionally, some PVA APL primitives require user allocated scratch buffers. We use helper macros to determine the correct sizes for these allocations. Note that in some circumstances, scratch buffers may be omitted. For example, when writing code that only targets Thor, it is acceptable to pass scratch buffers as NULL to Harris Corner detector. Refer to each primitive’s API documentation for details.

    // Harris Corners requires some scratch buffers for compatibility with Orin
    // For best performance, each should be located in a separate superbank
    VMEM(A, uint8_t, harrisScratch0, PVAAPL_HARRIS_SCRATCH_SIZE(IN_TILE_W, IN_TILE_H));
    VMEM(B, uint8_t, harrisScratch1, PVAAPL_HARRIS_SCRATCH_SIZE(IN_TILE_W, IN_TILE_H));
    VMEM(C, uint8_t, harrisScratch2, PVAAPL_HARRIS_SCRATCH_SIZE(IN_TILE_W, IN_TILE_H));
    
    // NMS requires a scratch buffer for compatibility with Orin
    VMEM(C, uint8_t, nmsScratch, PVAAPL_NMS_SCRATCH_SIZE(NMS_IN_TILE_W, NMS_IN_TILE_H));
    
  4. Next, we initialize the PVA APL handles. All PVA APL primitives provide the same set of APIs:

    • Init: should be done once at the start of the kernel

    • Update: optional, may be performed each tile iteration to update base pointers

    • Execute: used to commence kernel execution

    Here we initialize the Harris and NMS handles using their respective Init APIs:

    void HarrisDetect(HarrisDetectData *data)
    {
        // Declare APL handles
        PvaAplHarrisCornerS16 harrisHdl;
        PvaAplNms5x5S32 nmsHdl;
    
        // The Harris Corner APL function expects lambda to be provided as an integer, quantized between 0 and 65536
        int32_t lambdaQ = (int32_t)(LAMBDA * (float)(1 << 16));
        // Coefficients to the Harris Corner 3x3 separable smoothing filter
        int16_t coeff[] = {3, 10, 3};
    
        // We init both Harris corner and NMS APL handles
        pvaAplInitHarrisCornerS16(&harrisHdl, NULL, &coeff[0], QBITS, lambdaQ, IN_TILE_W, IN_TILE_H, IN_TILE_W,
                                  NMS_IN_TILE_W, &harrisOutput[0], &data->inTile[0], sizeof(data->inTile), harrisScratch0,
                                  harrisScratch1, harrisScratch2);
        pvaAplInitNms5x5S32(&nmsHdl, NULL, NMS_IN_TILE_W, NMS_IN_TILE_H, NMS_IN_TILE_W, OUT_TILE_W, NULL, NULL, 0,
                            nmsScratch);
    
  5. We now update and execute the Harris stage. In a real application, this code snippet would take place within a tile loop, so the same handle may be re-used for all tiles. In this case we are only processing a single tile. After calling pvaAplExecHarrisCornerS16Vpu() we need to synchronize with pvaAplWait(). Note that only a single PVA APL primitive may be executing at any time.

        // Handles may be updated after initialization. This helps in cases where the I/O is double buffered
        // and needs to be relocated between tiles
        pvaAplUpdateHarrisCornerS16(&harrisHdl, &data->inTile[0], &harrisOutput[0]);
        // Exec is used to start the primitive running
        pvaAplExecHarrisCornerS16(&harrisHdl);
        // Wait is used to stall the VPU until the primitive has completed
        pvaAplWait();
    
  6. Now we repeat the sequence for NMS, after which output data is written.

        // Repeat for NMS
        pvaAplUpdateNms5x5S32(&nmsHdl, &harrisOutput[0], &data->outTile[0]);
        pvaAplExecNms5x5S32(&nmsHdl);
        pvaAplWait();
    }
    

Host code#

To simplify this tutorial, we utilize the pva_device_testlib feature provided by PVA SDK CMake scripts. This allows us to call certain device functions synchronously from host code without needing to use any cuPVA Host API functions.

  1. We declare a host-side entrypoint (main function) and load some test data.

    int main(int argc, char **argv)
    {
        // Find the input asset
        constexpr int32_t MAX_IMAGE_PATH_LENGTH{320};
        char assetsDirectory[MAX_IMAGE_PATH_LENGTH];
        if (GetAssetsDirectory(argc, argv, assetsDirectory, MAX_IMAGE_PATH_LENGTH) != 0)
        {
            std::cout << "Skipped running this application - assets dir not available!" << std::endl;
            return 0;
        }
    
        // Fill input data
        HarrisDetectData data{};
        uint8_t inTileU8[IN_TILE_H * IN_TILE_W];
        {
            if (ReadImageBuffer(inputImageName.c_str(), assetsDirectory, &inTileU8[0], sizeof(inTileU8)) != 0)
            {
                std::cout << "Error reading " << inputImageName << std::endl;
                return -1;
            }
            // Promote input to 16bit
            for (int32_t i = 0; i < IN_TILE_H * IN_TILE_W; i++)
            {
                data.inTile[i] = inTileU8[i];
            }
        }
    
  2. The device function has been made available to us as a host-side API call, which we can now invoke directly. Internally, code has been generated to handle launching the device and synchronizing.

        // Run the device code
        HarrisDetect(&data);
    
  3. Finally, we write the output to a file.

        // Determine a threshold which yields around 20 corners
        int32_t threshold;
        {
            std::vector<int32_t> sortedScores(OUT_TILE_H * OUT_TILE_W);
            std::memcpy(sortedScores.data(), &data.outTile[0], sizeof(data.outTile));
            std::sort(sortedScores.begin(), sortedScores.end());
            threshold = sortedScores[(OUT_TILE_H * OUT_TILE_W) - 20];
        }
    
        // Annotate the input image with detected corners
        uint8_t imgOut[OUT_TILE_H * OUT_TILE_W];
        for (int32_t i = 0; i < OUT_TILE_H; i++)
        {
            for (int32_t j = 0; j < OUT_TILE_W; j++)
            {
                int32_t const idxOut      = i * OUT_TILE_W + j;
                constexpr int32_t xOffset = (IN_TILE_W - OUT_TILE_W) / 2;
                constexpr int32_t yOffset = (IN_TILE_H - OUT_TILE_H) / 2;
                int32_t const idxIn       = (i + yOffset) * IN_TILE_W + (j + xOffset);
                int32_t const score       = data.outTile[idxOut];
                if (score >= threshold)
                {
                    imgOut[idxOut] = std::numeric_limits<uint8_t>::max();
                }
                else
                {
                    imgOut[idxOut] = inTileU8[idxIn] / 2;
                }
            }
        }
    
        // Save the output
        if (WriteImageBuffer(outputImageName.c_str(), ".", &imgOut[0], sizeof(imgOut)) != 0)
        {
            std::cout << "Error writing " << outputImageName << std::endl;
            return -1;
        }
    }
    

Build Script#

To build the application, the following CMakeLists.txt may be used. We utilize pva_device_testlib to simplify the host code for this tutorial. To link with PVA APL, we need to explicitly pass pva_apl as a parameter to LIBS. Note that this also applies when defining a device workload with pva_device.

cmake_minimum_required(VERSION ${CMAKE_MINIMUM_REQUIRED_VERSION})
find_package(pva-sdk REQUIRED)
project(apl_harris)

pva_device_testlib(harris_testlib
    # Any args recognized by pva_device may be passed first, including source files
        vpu/HarrisDetect.c
    LIBS
        pva_apl
    # Arguments specific to pva_device_testlib are passed next
    PUBLIC_HEADERS
        vpu/HarrisDetect.h
    ENTRYPOINTS
        HarrisDetect
    DATA_SIZE
        65536
)

add_executable(apl_harris apl_harris.cpp)
target_link_libraries(apl_harris harris_testlib tutorial_utils)
install(TARGETS apl_harris
        RUNTIME
        COMPONENT pva-sdk-samples)

Running the Code#

The complete source code is installed along with PVA SDK samples. After building, it may be invoked as follows:

$ ./apl_harris -a <Tutorial Assets Directory Path>
Read 12288 bytes from <Tutorial Assets Directory Path>/kodim8_128x96_grayscale.raw
Wrote 10560 bytes to ./kodim8-120x88-corners.raw

The inputs and outputs are raw 8-bit grayscale data. Users may want to convert these to PNG for viewing. For example, using ImageMagick’s convert tool:

convert -size 120x88 -depth 8 gray:kodim8-120x88-corners.raw kodim8-120x88-corners.png

The input and output tiles should look like the below. The output displays a darkened input with the top 20 scoring corners overlaid in white.

Input Tile
Output Tile

Input Tile

Output Tile