VMEM Buffers, MemAlloc, and DataFlows#

In previous tutorials, we have covered mechanisms to run simple tasks on the PVA. However, one piece of the puzzle is still missing to enable the PVA to do the job that it is very good at; processing dense data such as image pixels. We have to move the data VPUs process to the PVA’s internal memory, VMEM. Likewise, processing results should be transferred back to the DRAM to feed the subsequent stages of the pipeline, possibly running on different engines.

PVA utilizes the Direct Memory Access (DMA) engine to transfer data in and out of the VMEM. DMA tasks can be partitioned into tiles and can run in parallel to the VPU tasks to maximize the processing throughput by hiding the memory access latency. The cuPVA runtime abstracts the DMA tasks as DataFlows that we attach to the CmdPrograms.

In this tutorial, we see:

How to allocate a buffer in VMEM.
How to allocate a buffer in DRAM that is accessible by both the host CPU and the DMA.
How to configure simple DataFlows.
How to trigger a tile transfer from the VPU.

This tutorial is our last iteration of the “Hello World!” application. This time the string buffer the VPU prints is filled by triggering a DataFlow.

Device Code#

VPU prints message buffer filled by a DataFlow transfer.

Include required device side (PVA side) header files first.

#include <cupva_device.h>       /* Main device-side header file */
#include <cupva_device_debug.h> /* Header file containing the printf function */

In VMEM Scalars Tutorial, we declared a scalar value using the VMEM macro. The same macro can be used to declare a buffer. messageBufferVMEM is filled by the DMA transfer.
```
#define MAX_MESSAGE_BUFFER_SIZE 128
VMEM(A, char, messageBufferVMEM, MAX_MESSAGE_BUFFER_SIZE);
```
dataFlowTrigHandle is used as a handle to trigger tile transfers.
```
VMEM_RDF_UNIFIED(A, dataFlowTrigHandle);
```
VPU initiates the DataFlow transfer by calling the cupvaDataFlowTrig() function with the declared handle. cupvaDataFlowTrig() immediately returns after starting the tile transfer.
```
CUPVA_VPU_MAIN()
{
    cupvaRasterDataFlowOpen(dataFlowTrigHandle, &messageBufferVMEM[0]);
    cupvaRasterDataFlowAcquire(dataFlowTrigHandle);
```
cupvaDataFlowSync() blocks the VPU process until the tile transfer gets completed. String buffer we would like to print should be ready when the cupvaDataFlowSync() call returns. We simply print the buffer contents and return from the main function.
```
    cupvaRasterDataFlowRelease(dataFlowTrigHandle);
    cupvaRasterDataFlowClose(dataFlowTrigHandle);

    printf("%s\n", messageBufferVMEM);
    return 0;
}
```

Host Code#

C++

In the host side “main” function, we first create an Executable object and a CmdProgram similar to the previous tutorials.

#include <cupva_host.hpp>           // Main host-side C++-API header file
#include <cupva_host_nonsafety.hpp> // Header file for VPU printf functionality.

#include <cupva_platform.h> // Header that includes macros for specifying PVA executables

#include <iostream>

using namespace cupva;

PVA_DECLARE_EXECUTABLE(dataflows_and_memalloc_dev)

constexpr size_t MAX_MSG_LEN{128};

int main()
{
    try
    {
        Executable exec = Executable::Create(PVA_EXECUTABLE_DATA(dataflows_and_memalloc_dev),
                                             PVA_EXECUTABLE_SIZE(dataflows_and_memalloc_dev));
        CmdProgram prog = CmdProgram::Create(exec);

We allocate a char buffer in DRAM using the mem::Alloc() API. Memory space is accessible by both the host and PVA’s DMA engine when created with the mem::Alloc() API. The memory size in bytes should be provided as an argument.
```
        char *messageBufferDRAM_d = (char *)mem::Alloc(MAX_MSG_LEN);
```
CPU host and PVA use different addressing schemes for the memory. The pointer returned by the mem::Alloc() call is only accessible by the PVA DMA engine, therefore we call it a “device-side pointer.” Corresponding CPU mapped buffer pointer is called a “host-side pointer.” mem::GetHostPointer() API is called to translate a device-side pointer to a host-side pointer. We put the “Hello World!” message that we asks VPU to print in the allocated buffer.
```
        char *messageBufferDRAM_h = (char *)mem::GetHostPointer(messageBufferDRAM_d);
        snprintf(messageBufferDRAM_h, MAX_MSG_LEN, "Hello World!");
```
In previous tutorials, we mentioned that the VPU can only load from or store to the VMEM. A VMEM buffer is declared at the device code to hold the data to be printed. We can get the pointer to the declared buffer by using its name as the key. Remember that the prog now wraps device side binary within the Executable objects.
```
        char *messageBufferVMEM = prog["messageBufferVMEM"].ptr<char>();
```
It is now time to setup the mechanism to move DRAM source buffer to the destination in VMEM. The data transfer tasks are handled by the DMA engine and are called DataFlows. We create a DataFlow instance and attach it to CmdProgram using the addDataFlowHead API. cuPVA provides various DataFlow abstractions tailored for different tasks, namely RasterDataFlow, SequenceDataFlow, GatherScatterDataFlow, and TensorDataFlow. In this example, we use RasterDataFlow that enables general purpose use of DMA. Other DataFlow types and configurations are introduced in the upcoming tutorials.
```
        RasterDataFlow &dataFlow = prog.addDataFlowHead<RasterDataFlow>();
```
DataFlow transfers are triggered using handles. Handles are basicly VMEM variables. The handle pointer can be acquired the same way as the scalars. VPU uses the handle to trigger the transfer when it needs the data.
```
        auto dataFlowTrigHandle = prog["dataFlowTrigHandle"];
```
Now we have to configure the DataFlow to tell it what is the source and destination for the data, how much data is transferred and how the transfer can be triggered. The unit of data transferred by the DMA engine is called a tile. A large data buffer can be moved in multiple tiles when needed. cuPVA enables users specify tiles as 2-D blocks of image pixels or data. In this example we just transfer a small 1-dimensional string buffer that can easily fit in a single tile.
```
        dataFlow.handler(dataFlowTrigHandle)
            .tile(strnlen(messageBufferDRAM_h, MAX_MSG_LEN), 1)
            .src(messageBufferDRAM_d, strnlen(messageBufferDRAM_h, MAX_MSG_LEN), 1)
            .tileBuffer(messageBufferVMEM);
```
The DataFlows in the Command Program cannot be used without compilation. We compile the DataFlows into an internal format which can be submitted to the PVA.
```
        prog.compileDataFlows();
```

The rest of the code is same as earlier tutorials. We submit the CmdProgram to the Stream and wait for Fence to expire. The “Hello World!” message should be printed when the Fence expires.

        SetVPUPrintBufferSize(64 * 1024);
        SyncObj sync = SyncObj::Create();
        Fence fence{sync};
        CmdRequestFences rf{fence};
        Stream stream = Stream::Create();

        CmdStatus status[2];
        stream.submit({&prog, &rf}, status);

        fence.wait();
        cupva::Error statusCode = CheckCommandStatus(status[0]);
        if (statusCode != Error::None)
        {
            std::cout << "VPU Program returned an Error Code: " << (int32_t)statusCode << std::endl;
        }
        mem::Free(messageBufferDRAM_d);
    }
    catch (cupva::Exception const &e)
    {
        std::cout << "Caught a cuPVA exception with message: " << e.what() << std::endl;
        return 1;
    }
    return 0;
}

C

In the host side “main” function, we first create an Executable object and a CmdProgram similar to the previous tutorials.

#include <cupva_host.h>           /* Main host-side C-API header file */
#include <cupva_host_nonsafety.h> /* Header file for VPU printf functionality. */
#include <cupva_platform.h>       /* Header that includes macros for specifying PVA executables */
#include <stdio.h>
#include <string.h>

#define CHECK_ERROR_GOTO(__v, __e, __l)                  \
    __e = __v;                                           \
    if (__e != CUPVA_ERROR_NONE)                         \
    {                                                    \
        printf("cuPVA C-API return error: %d\n", (__v)); \
        goto __l;                                        \
    }

PVA_DECLARE_EXECUTABLE(dataflows_and_memalloc_dev)

#define MAX_MSG_LEN 128

int main(int argc, char **argv)
{
    int32_t err = 0;

    cupvaExecutable_t exec;
    CHECK_ERROR_GOTO(CupvaExecutableCreate(&exec, PVA_EXECUTABLE_DATA(dataflows_and_memalloc_dev),
                                           PVA_EXECUTABLE_SIZE(dataflows_and_memalloc_dev)),
                     err, ExecutableCreateFailed);
    cupvaCmd_t prog;
    CHECK_ERROR_GOTO(CupvaCmdProgramCreate(&prog, exec), err, CmdProgramCreateFailed);

We allocate a char buffer in DRAM using the CupvaMemAlloc() API. Memory space is accessible by both the host and PVA’s DMA engine when created with the CupvaMemAlloc() API. The memory size in bytes should be provided as an argument.

    char *messageBufferDRAM_d = NULL;
    CHECK_ERROR_GOTO(CupvaMemAlloc((void **)&messageBufferDRAM_d, MAX_MSG_LEN, CUPVA_READ_WRITE, CUPVA_ALLOC_DRAM), err,
                     MemAllocFailed);

CPU host and PVA use different addressing schemes for the memory. The pointer returned by the CupvaMemAlloc() call is only accessible by the PVA DMA engine, therefore we call it a “device-side pointer.” Corresponding CPU mapped buffer pointer is called a “host-side pointer.” CupvaMemGetHostPointer() API is called to translate a device-side pointer to a host-side pointer. We put the “Hello World!” message that we ask VPU to print in the allocated buffer.
```
    char *messageBufferDRAM_h = NULL;
    CHECK_ERROR_GOTO(CupvaMemGetHostPointer((void **)&messageBufferDRAM_h, (void *)messageBufferDRAM_d), err,
                     SyncObjCreateFailed);
    snprintf(messageBufferDRAM_h, MAX_MSG_LEN, "Hello World!");
```

In previous tutorials, we have mentioned that the VPU can only load from or store to the VMEM. A VMEM buffer is declared at the device code to hold the data to be printed. We can get the pointer to the declared buffer by calling the CupvaCmdProgramGetParameter() and CupvaParameterGetDevicePointer() APIs consecutively. Remember that the prog now wraps device side binary within the Executable objects.

    cupvaParameter_t paramMessageBufferVMEM;
    CHECK_ERROR_GOTO(CupvaCmdProgramGetParameter(&prog, &paramMessageBufferVMEM, "messageBufferVMEM"), err,
                     SyncObjCreateFailed);
    char *messageBufferVMEM;
    CHECK_ERROR_GOTO(CupvaParameterGetDevicePointer(&paramMessageBufferVMEM, (void const **)&messageBufferVMEM), err,
                     SyncObjCreateFailed);

It is now time to setup the mechanism to move DRAM source buffer to the destination in VMEM. The data transfer tasks are handled by the DMA engine and are called DataFlows. We create a DataFlow instance and attach it to CmdProgram using the CupvaCmdProgramAddDataFlowHead() API. cuPVA provides various DataFlow abstractions tailored for different tasks, namely RasterDataFlow, SequenceDataFlow, GatherScatterDataFlow, and TensorDataFlow. In this example, we use RasterDataFlow that enables general purpose use of DMA. Other DataFlow types and configurations are introduced in the upcoming tutorials.
```
    cupvaDataFlow_t dataFlow;
    CHECK_ERROR_GOTO(CupvaCmdProgramAddDataFlowHead(&prog, &dataFlow, CUPVA_RASTER_DATAFLOW, 0, 1.0F), err,
                     SyncObjCreateFailed);
```

DataFlow transfers are triggered using handles. Handles are basicly VMEM variables. The handle pointer can be acquired the same way as the scalars. VPU uses the handle to trigger the transfer when it needs the data.

    cupvaParameter_t dataFlowTrigHandle;
    CHECK_ERROR_GOTO(CupvaCmdProgramGetParameter(&prog, &dataFlowTrigHandle, "dataFlowTrigHandle"), err,
                     SyncObjCreateFailed);

Now we have to configure the DataFlow to tell it what is the source and destination for the data, how much data is transferred and how the transfer can be triggered. An instance of cupvaRasterDataFlowParams_t structure is initialized to hold required parameters. CupvaRasterDataFlowSetParams() API is called to set the parameters. The unit of data transferred by the DMA engine is called a tile. A large data buffer can be moved in multiple tiles when needed. cuPVA enables users to specify tiles as 2-D blocks of image pixels or data. In this example, we just transfer a small 1-dimensional string buffer that can easily fit in a single tile.

    cupvaRasterDataFlowParams_t dataFlowParams = {};

    dataFlowParams.handler   = &dataFlowTrigHandle;
    dataFlowParams.tileWidth = strnlen(messageBufferDRAM_h, MAX_MSG_LEN); /* width of a 2-D tile in number of pixels  */
    dataFlowParams.bpp =
        sizeof(char); /* bytes per pixel. We are transferring a char array, so our pixel size is 1 bytes. */
    dataFlowParams.ptrSrc = messageBufferDRAM_d; /* device pointer of the source buffer */
    dataFlowParams.srcWidth = strnlen(messageBufferDRAM_h, MAX_MSG_LEN); /* source buffer length */
    dataFlowParams.srcHeight = 1; /* source buffer height */
    dataFlowParams.ptrTileBuffer = messageBufferVMEM;   /* device pointer of the destination buffer */

    CHECK_ERROR_GOTO(CupvaRasterDataFlowSetParams(dataFlow, &dataFlowParams), err, SyncObjCreateFailed);

The DataFlows in the Command Program cannot be used without compilation. We compile the DataFlows into an internal format which can be submitted to the PVA.
```
    CHECK_ERROR_GOTO(CupvaCmdProgramCompileDataFlows(&prog), err, SyncObjCreateFailed);
```

The rest of the code is same as earlier tutorials. We submit the CmdProgram to the Stream and wait for Fence to expire. The “Hello World!” message should be printed when the Fence expires.

    CHECK_ERROR_GOTO(CupvaSetVPUPrintBufferSize(64 * 1024), err, SyncObjCreateFailed);

    cupvaSyncObj_t sync;
    CHECK_ERROR_GOTO(CupvaSyncObjCreate(&sync, false, CUPVA_SIGNALER_WAITER, CUPVA_SYNC_YIELD), err,
                     SyncObjCreateFailed);

    cupvaFence_t fence;
    CHECK_ERROR_GOTO(CupvaFenceInit(&fence, sync), err, StreamCreateFailed);

    cupvaCmd_t rf;
    CHECK_ERROR_GOTO(CupvaCmdRequestFencesInit(&rf, &fence, 1), err, StreamCreateFailed);

    cupvaStream_t stream;
    CHECK_ERROR_GOTO(CupvaStreamCreate(&stream, CUPVA_PVA0, CUPVA_VPU_ANY), err, StreamCreateFailed);

    cupvaCmd_t const *cmds[2]  = {&prog, &rf};
    cupvaCmdStatus_t status[2] = {NULL, NULL};
    CHECK_ERROR_GOTO(CupvaStreamSubmit(stream, cmds, status, 2, CUPVA_IN_ORDER, -1, -1), err, DeAllocateAllResources);

    bool waitSuccess;
    CHECK_ERROR_GOTO(CupvaFenceWait(&fence, -1, &waitSuccess), err, DeAllocateAllResources);

    cupvaError_t statusCode = CUPVA_ERROR_NONE;
    CupvaCheckCommandStatus(status[0], &statusCode);
    if (statusCode != CUPVA_ERROR_NONE)
    {
        printf("VPU Program returned an Error Code: %d\n", (int32_t)statusCode);
    }

DeAllocateAllResources: /* clean up all allocated resources */
    CupvaStreamDestroy(stream);
StreamCreateFailed: /* clean up resources allocated prior to StreamCreate */
    CupvaSyncObjDestroy(sync);
SyncObjCreateFailed: /* clean up resources allocated prior to SyncObjCreate */
    CupvaMemFree(messageBufferDRAM_d);
MemAllocFailed: /* clean up resources allocated prior to MemAlloc */
    CupvaCmdDestroy(&prog);
CmdProgramCreateFailed: /* clean up resources allocated prior to CmdProgramCreate */
    CupvaExecutableDestroy(exec);
ExecutableCreateFailed: /* clean up resources allocated prior to ExecutableCreate */
    return err;
}

Output#

C++

Running the application on the target should print the “Hello World!” message.

$ ./dataflows_and_memalloc_cpp
Hello World!

C

Running the application on the target should print the “Hello World!” message.

$ ./dataflows_and_memalloc_c
Hello World!

Note on Project Structure#

It should be clear from the above tutorial that the triggering pattern in VPU code is tightly coupled to the DataFlows that were set up by the host code. This dependency is hard to express in a C header and export between different units or compilation environments. Therefore, we recommend that this not be used as the basis for a stable API or ABI.

We recommend that users should couple their VPU code with corresponding DataFlow setup code by grouping code together in shared subdirectories, sharing header files, and compiling code into a single build artifact to minimize risk of version mismatch.