Sample Application: Erode Operation#

The following example demonstrates the basic usage of PVA Operators to perform an erode operation. This step-by-step tutorial will guide you through the necessary steps to create and submit the operator for execution on the PVA. PVA operators support two submission paths: CUDA streams for heterogeneous GPU-PVA pipelines, and cuPVA streams for PVA-centric workloads. Note that this example is a simplified version of the actual implementation in the repository. For clarity, error handling code has been omitted.

Step 1: Create an Allocator Handle

The setup phase of the application involves creating PVA-accessible input and output image buffers. PVA Operators utilize NVCV Tensor data structures to store the image attributes and data. First, you need to construct an allocator instance. In the next steps, the allocator handle will be passed to Tensor Construction APIs to allocate tensors with PVA-accessible memory.

NVCVAllocatorHandle allocatorHandle = NULL;
nvcvAllocatorConstructPva(&allocatorHandle);

Step 2: Calculate Tensor Requirements

Next, calculate the requirements for the input and output tensors. The NVCVTensorRequirements structure holds all the information about the tensor, including the shape, stride, data type, and layout. The nvcvTensorCalcRequirementsPva API initializes this structure, which will be passed to the tensor construction APIs and operator creation APIs.

NVCVTensorLayout tensorLayout;
nvcvTensorLayoutMake("HWC", &tensorLayout);

int64_t tensorShape[] = {imgHeight, imgWidth, channelCount};
NVCVTensorRequirements tensorRequirements;
nvcvTensorCalcRequirementsPva(tensorRank, tensorShape, NVCV_DATA_TYPE_U8, tensorLayout,
                                                             0, 0, &tensorRequirements);

Step 3: Construct Input/Output Tensors

With the tensor requirements calculated, you can now construct the input (inTensorHandle) and output (outTensorHandle) tensors. The allocator handle created in Step 1 is passed to the tensor construction APIs to allocate the tensors with PVA-accessible memory. Please note that if the tensors were constructed with the default NVCV allocator, the tensor data would not be accessible to the PVA.

NVCVTensorHandle inTensorHandle;
NVCVTensorHandle outTensorHandle;
nvcvTensorConstruct(&tensorRequirements, allocatorHandle, &inTensorHandle);
nvcvTensorConstruct(&tensorRequirements, allocatorHandle, &outTensorHandle);

Step 4: Create the Morphology Operator

Create the morphology operator, which will perform the erode operation. This involves setting up the morphology mask parameters and initializing the operator handle with the pvaMorphologyCreate API. Specify the tensor requirements, border type, and border value for the operator creation. The Create API initializes the cuPVA Executables and CmdPrograms required to schedule the operator task on the PVA. It also sets up cuPVA DataFlows that will be used to transfer image data in and out of the internal VPU memory (VMEM) using the DMA engine.

PvaMorphologyMaskParams maskParams;
maskParams.maskWidth  = knlWidth;
maskParams.maskHeight = knlHeight;
maskParams.maskShape  = RECTANGLE_MASK;
NVCVOperatorHandle operatorHandle;
pvaMorphologyCreate(&operatorHandle, &tensorRequirements, PVA_ERODE, &maskParams,
                                               NVCV_BORDER_CONSTANT, borderValue);

Step 5: Submit the Operator

PVA operators can be submitted to either CUDA streams or cuPVA streams, depending on your application requirements. The CUDA stream submission path is particularly beneficial for creating heterogeneous processing pipelines that combine PVA and CUDA operators, enabling seamless interoperability within a unified execution context. However, mixing CUDA stream and cuPVA stream submissions within the same application is not recommended, as this may result in cache coherency issues between the PVA and GPU memory subsystems. Choose one submission path consistently throughout your application.

Option A: Submit to CUDA Stream

CUDA stream submission allows PVA operators to integrate naturally into CUDA-based pipelines. This approach leverages the CUDA runtime’s implicit synchronization mechanisms and enables direct composition with CUDA operators. The operator can be submitted using the pvaMorphologySubmit API with a CUDA stream handle, and execution can be synchronized using standard CUDA stream synchronization primitives.

Requirements:

PVA SDK version 2.7.0 or later
Jetpack 7.0 or later (PVA driver version 2007+)
DriveOS 7.0 or later (PVA driver version 2007+)

Note

CUDA stream submission is not supported on the x86 emulator.

cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
pvaMorphologySubmit(operatorHandle, stream, inTensorHandle, outTensorHandle);

Option B: Submit to cuPVA Stream

cuPVA stream submission provides control over PVA execution and is designed for PVA-centric workloads. Create a cuPVA stream to manage the scheduling of the operator task on the PVA hardware. The operator can be submitted to the cuPVA stream using the pvaMorphologySubmit API. Ensure that the tensor parameters are consistent with the tensor requirements specified during the operator creation. The operator submit API calls are non-blocking and return immediately after the operator is submitted to the stream. You can submit the same operator instance to multiple streams, which is especially useful for processing multiple video streams concurrently using both VPU cores.

cupvaStream_t stream;
CupvaStreamCreate(&stream, CUPVA_PVA0, CUPVA_VPU_ANY);
pvaMorphologySubmit(operatorHandle, stream, inTensorHandle, outTensorHandle);

Step 6: Synchronize and Wait for Completion

The synchronization mechanism depends on which stream type was used for operator submission.

Option A: Synchronize CUDA Stream

For CUDA stream submissions, use standard CUDA synchronization primitives to wait for operator completion. This approach integrates naturally with existing CUDA workflows and device-host synchronization patterns.

cudaStreamSynchronize(stream);

Option B: Synchronize cuPVA Stream

For cuPVA stream submissions, use cuPVA synchronization APIs (sync objects, fences, and RequestFence commands) to manage synchronization between the host and the PVA. A fence is created and submitted to the stream using a RequestFence command to signal when the operator has completed its execution. After submission, the host waits for the fence to be signaled before reading the output tensor data. You can submit multiple operators to a stream and wait for all of them to complete using a single fence.

cupvaSyncObj_t sync;
CupvaSyncObjCreate(&sync, false, CUPVA_SIGNALER_WAITER, CUPVA_SYNC_YIELD);
cupvaFence_t fence;
CupvaFenceInit(&fence, sync);
cupvaCmd_t requestfFences;
CupvaCmdRequestFencesInit(&requestfFences, &fence, 1);
cupvaCmd_t const *cmds[1]  = {&requestfFences};
CupvaStreamSubmit(stream, cmds, NULL, 1, CUPVA_IN_ORDER, -1, -1);
bool waitSuccess;
CupvaFenceWait(&fence, -1, &waitSuccess);

Step 7: Free Resources

The destruction phase of the application involves freeing the resources used by the operator and tensors. Ensure that all streams and synchronization objects are properly destroyed in this phase to prevent resource leaks and ensure clean application termination.

Destroy the stream handle using cudaStreamDestroy() for CUDA streams or CupvaStreamDestroy() for cuPVA streams. For cuPVA streams, also destroy the sync object using CupvaSyncObjDestroy(). Finally, release all NVCV resources by decrementing their reference counts.

// Destroy stream (CUDA or cuPVA)
cudaStreamDestroy(stream);        // For CUDA stream
// OR
CupvaStreamDestroy(stream);       // For cuPVA stream
CupvaSyncObjDestroy(sync);        // For cuPVA stream only

// Release NVCV resources
nvcvTensorDecRef(inTensorHandle, NULL);
nvcvTensorDecRef(outTensorHandle, NULL);
nvcvOperatorDestroy(operatorHandle);
nvcvAllocatorDecRef(allocatorHandle, NULL);