CUDA Interoperability#

Complex image/data processing pipelines consist of multiple computation stages. Some pipeline stages may be mapped to the GPU along with the PVA processing stages to fully utilize device resources. The cuPVA runtime provides CUDA Interoperability APIs to conveniently realize such applications.

CUDA Interoperability APIs provide functionality to:

Create cuPVA device pointers from CUDA device pointers. PVA programs can pass data to subsequent or ingest from previous CUDA processing stages using this API.
Create cuPVA Stream objects from CUDA stream handles. PVA programs can be scheduled to run consecutively with the CUDA kernels using this API.

In this tutorial, we extend the contrast stretching example we used in the previous tutorials to demonstrate CUDA Interoperability APIs. The first stage of the algorithm that involves image histogram computation to determine the input dynamic range of pixels are mapped to the PVA. Subsequent pixel processing stage that stretches the contrast are performed on the GPU using CUDA kernels.

Host Code#

C++

The main function starts by initializing the image buffer and parameter pointers to nullptr.

int main(int argc, char **argv)
{
    int err = 0;
    if (GetAssetsDirectory(argc, argv, assetsDirectory, MAX_IMAGE_PATH_LENGTH) != 0)
    {
        return 1;
    }

    uint8_t *image_cpu                        = nullptr;
    uint8_t *image_cuda                       = nullptr;
    ContrastStretchParams *stretchParams_cpu  = nullptr;
    ContrastStretchParams *stretchParams_cuda = nullptr;
    cudaStream_t cudaStream                   = nullptr;

    try
    {
        // This tutorial assumes deployment on a platform which allow cuPVA to implicitly access
        // CUDA device pointers. If targeting an older driver version, use cupva::mem::Register.
        if (cupva::GetDriverVersion() < 2006U)
        {
            printf("Unsupported platform, skipping execution...\n");
            return 0;
        }

The second stage, i.e., contrast stretching stage, of the algorithm is executed on the GPU. Contrast stretching stage uses the algorithm parameters computed on PVA in the first stage.

Memory for algorithm parameters structure and image buffer are allocated using the cudaHostAlloc() CUDA API call. cudaHostAlloc() allocates a page-locked memory on the host and maps the allocation into device space when the cudaHostAllocMapped option is selected.

For memories allocated using CUDA APIs, cuPVA and CUDA share address spaces, so the CUDA device pointer can be directly passed to CUPVA APIs. PVA program that executes in the first stage loads the image and store the computed image dynamic range to the imported memory locations.

        if (cudaSuccess != cudaHostAlloc(&stretchParams_cpu, sizeof(ContrastStretchParams), cudaHostAllocMapped))
        {
            throw std::runtime_error("cudaHostAlloc failed");
        }
        if (cudaSuccess != cudaHostGetDevicePointer(&stretchParams_cuda, stretchParams_cpu, 0))
        {
            throw std::runtime_error("cudaHostGetDevicePointer failed");
        }
        if (cudaSuccess != cudaHostAlloc(&image_cpu, IMAGE_SIZE * sizeof(uint8_t), cudaHostAllocMapped))
        {
            throw std::runtime_error("cudaHostAlloc failed");
        }
        if (cudaSuccess != cudaHostGetDevicePointer(&image_cuda, image_cpu, 0))
        {
            throw std::runtime_error("cudaHostGetDevicePointer failed");
        }

A CUDA stream is a sequence of operations that execute in issue-order on the GPU. cudaStreamCreateWithFlags CUDA API creates an asynchronous stream. cuPVA’s CUDA Interoperability does not support CUDA default stream therefore the CUDA stream handle must be created with the cudaStreamNonBlocking flag. Once a CUDA stream is created, all the buffers allocated using CUDA APIs are directly available in cuPVA space.
```
        if (cudaSuccess != cudaStreamCreateWithFlags(&cudaStream, cudaStreamNonBlocking))
        {
            throw std::runtime_error("cudaStreamCreateWithFlags failed");
        }
```

cuPVA cuda::CreateStream() API creates a PVA Stream object from a CUDA stream handle. The cuPVA Stream and the cudaStream can be used to submit PVA and CUDA commands respectively. To create a Stream using this API, the current cuPVA context must have been created after a CUDA context has been loaded. cuda::ContextSupportsCuda() API checks if the current context handle supports CUDA interoperability. Any CUDA runtime API call causes a CUDA context to be loaded, as would a CUDA driver API call to cuCtxCreate() . This is the reason why the CUDA memory allocation API calls are placed before the cuPVA API calls in this code. Note that the cupva::GetHardwareInfo() call does not create a cuPVA context.

        if (cudaSuccess != cudaHostAlloc(&stretchParams_cpu, sizeof(ContrastStretchParams), cudaHostAllocMapped))
        {
            throw std::runtime_error("cudaHostAlloc failed");
        }
        if (cudaSuccess != cudaHostGetDevicePointer(&stretchParams_cuda, stretchParams_cpu, 0))
        {
            throw std::runtime_error("cudaHostGetDevicePointer failed");
        }
        if (cudaSuccess != cudaHostAlloc(&image_cpu, IMAGE_SIZE * sizeof(uint8_t), cudaHostAllocMapped))
        {
            throw std::runtime_error("cudaHostAlloc failed");
        }
        if (cudaSuccess != cudaHostGetDevicePointer(&image_cuda, image_cpu, 0))
        {
            throw std::runtime_error("cudaHostGetDevicePointer failed");
        }

PVA program that computes the image dynamic range is created. The steps involving the CmdProgram initialization are very similar to the previous contrast stretching based tutorials. CUDA allocated parameters and image pointers are used as the input and output for the program, respectively.

        if (ReadImageBuffer(inputImageName.c_str(), assetsDirectory, image_cpu, IMAGE_SIZE) != 0)
        {
            cudaFreeHost(image_cpu);
            cudaFreeHost(stretchParams_cpu);
            err = -1;
            return err;
        }

        ContrastStretchParams algParams = {
            .inputLowPixelValue           = 0,
            .outputLowPixelValue          = 0,
            .inputHighPixelValue          = 0,
            .outputHighPixelValue         = 255,
            .saturationHistogramCountLow  = IMAGE_SIZE * SATURATED_PIXEL_PERCENTAGE_LOW_INTENSITY / 100,
            .saturationHistogramCountHigh = IMAGE_SIZE * SATURATED_PIXEL_PERCENTAGE_HIGH_INTENSITY / 100};

        memcpy(stretchParams_cpu, &algParams, sizeof(ContrastStretchParams));

        Executable execComputeDynamicRange =
            Executable::Create(PVA_EXECUTABLE_DATA(cuda_interoperability_compute_dynamic_range_dev),
                               PVA_EXECUTABLE_SIZE(cuda_interoperability_compute_dynamic_range_dev));

        CmdProgram progComputeDynamicRange =
            CreateComputeDynamicRangeProg(execComputeDynamicRange, image_cuda, IMAGE_WIDTH, IMAGE_HEIGHT, TILE_WIDTH,
                                          TILE_HEIGHT, stretchParams_cuda);

progComputeDynamicRange is submitted to the imported stream. cuPVA commands submitted to the imported cuPVA stream and CUDA commands submitted to the cudaStream begin execution in the order they are submitted as if they share the same stream. More specifically, PVA commands execute following the OrderType specified by the cupva::Stream::submit() call, while CUDA commands execute per the configuration of the cudaStream. Notice that this time we did not submit a CmdRequestFences at the end and did not call cupva::Fence::wait() after submission. Synchronization is achieved with CUDA APIs as we see in the next steps.
```
        cupvaStream.submit({&progComputeDynamicRange});
```

The StretchContrast() call is submitted to the CUDA stream we had imported in the previous steps. Therefore, it is executed after the progComputeDynamicRange PVA program completes.

        StretchContrast(image_cuda, stretchParams_cuda, IMAGE_SIZE, cudaStream);

The stretch_kernel CUDA function performs image contrast stretching on the GPU. stretch_kernel is submitted to the cudaStream and runs asynchronously.

__global__ void stretch_kernel(uint8_t *image, ContrastStretchParams *stretchParams_cuda, int32_t imageSize)
{
    int32_t inputDynamicRange  = stretchParams_cuda->inputHighPixelValue - stretchParams_cuda->inputLowPixelValue + 1;
    int32_t outputDynamicRange = 256;

    int index  = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    int32_t outputPixelValue;

    for (int i = index; i < imageSize; i += stride)
    {
        outputPixelValue = outputDynamicRange * (image[i] - stretchParams_cuda->inputLowPixelValue);
        outputPixelValue /= inputDynamicRange;
        outputPixelValue = max(0, min(outputPixelValue, 255));
        image[i]         = (uint8_t)outputPixelValue;
    }
}

void StretchContrast(uint8_t *image, ContrastStretchParams *stretchParams_cuda, int32_t imageSize,
                     cudaStream_t cudaStream)
{
    int blockSize  = BLOCK_SIZE;
    int blockCount = BLOCK_COUNT;
    stretch_kernel<<<blockCount, blockSize, 0, cudaStream>>>(image, stretchParams_cuda, imageSize);
}

The cudaStreamSynchronize call blocks the host until all CUDA calls in cudaStream and PVA commands submitted to the cuPVA stream runs to completion.

        if (cudaSuccess != cudaStreamSynchronize(cudaStream))
        {
            throw std::runtime_error("cudaStreamSynchronize failed");
        }

The enhanced image is written to the output file and allocated resources are freed in the last step.

        if (WriteImageBuffer(outputImageName.c_str(), ".", image_cpu, IMAGE_SIZE) != 0)
        {
            err = -1;
        }
    }
    catch (cupva::Exception const &e)
    {
        std::cout << "Caught a cuPVA exception with message: " << e.what() << std::endl;
        err = 1;
    }
    catch (const std::runtime_error &e)
    {
        std::cout << "Caught a CUDA exception with message: " << e.what() << std::endl;
    }

    cudaFreeHost(image_cpu);
    cudaFreeHost(stretchParams_cpu);
    cudaStreamDestroy(cudaStream);
    return err;
}

C

The main function starts by initializing the image buffer and parameter pointers to NULL.

int main(int argc, char **argv)
{
    int32_t err = 0;

    /* This tutorial assumes deployment on a platform which allow cuPVA to implicitly access
     * CUDA device pointers. If targeting an older driver version, use CupvaMemRegister. */
    uint32_t driverVersion;
    if (CupvaGetDriverVersion(&driverVersion) != CUPVA_ERROR_NONE)
    {
        return 1;
    }
    if (driverVersion < 2006U)
    {
        printf("Unsupported platform, skipping execution...\n");
        return 0;
    }

    if (GetAssetsDirectory(argc, argv, assetsDirectory, MAX_IMAGE_PATH_LENGTH) != 0)
    {
        return 1;
    }

    uint8_t *image_cpu                        = NULL;
    uint8_t *image_cuda                       = NULL;
    ContrastStretchParams *stretchParams_cpu  = NULL;
    ContrastStretchParams *stretchParams_cuda = NULL;
    cudaStream_t cudaStream                   = NULL;

The second stage, i.e., contrast stretching stage, of the algorithm is executed on the GPU. Contrast stretching stage uses the algorithm parameters computed on PVA in the first stage.

Memory for algorithm parameters structure and image buffer are allocated using the cudaHostAlloc() CUDA API call. cudaHostAlloc() allocates a page-locked memory on the host and maps the allocation into device space when the cudaHostAllocMapped option is selected.

For memories allocated using CUDA APIs, cuPVA and CUDA share address spaces, so the CUDA device pointer can be directly passed to CUPVA APIs. The PVA program that executes in the first stage loads the image and store the computed image dynamic range to the imported memory locations.
```
    CHECK_CUDA_ERROR_GOTO(
        cudaHostAlloc((void **)&stretchParams_cpu, sizeof(ContrastStretchParams), cudaHostAllocMapped), err,
        MemAllocFailed);

    CHECK_CUDA_ERROR_GOTO(cudaHostGetDevicePointer((void **)&stretchParams_cuda, (void *)stretchParams_cpu, 0), err,
                          MemAllocFailed);

    CHECK_CUDA_ERROR_GOTO(cudaHostAlloc((void **)&image_cpu, IMAGE_SIZE * sizeof(uint8_t), cudaHostAllocMapped), err,
                          MemAllocFailed);

    CHECK_CUDA_ERROR_GOTO(cudaHostGetDevicePointer((void **)&image_cuda, (void *)image_cpu, 0), err, MemAllocFailed);
```
A CUDA stream is a sequence of operations that execute in issue-order on the GPU. The cudaStreamCreateWithFlags CUDA API creates an asynchronous stream. cuPVA’s CUDA Interoperability does not support CUDA default stream therefore the CUDA stream handle must be created with the cudaStreamNonBlocking flag. Once a CUDA stream is created, all the buffers allocated using CUDA APIs are directly available in cuPVA space.
```
    CHECK_CUDA_ERROR_GOTO(cudaStreamCreateWithFlags(&cudaStream, cudaStreamNonBlocking), err, MemAllocFailed);
```
The CupvaCudaCreateStream() API creates a PVA Stream object from a CUDA stream handle. The cuPVA Stream and the cudaStream can be used to submit PVA and CUDA commands respectively. To create a Stream using this API, the current cuPVA context must have been created after a CUDA context has been loaded. The CupvaCudaContextSupportsCuda() API checks if the current context handle supports CUDA interoperability. Any CUDA runtime API call causes a CUDA context to be loaded, as would a CUDA driver API call to cuCtxCreate(). This is the reason why the CUDA memory allocation API calls are placed before the cuPVA API calls in this code. Note that the cupva::GetHardwareInfo() call does not create cuPVA context.
```
    bool cudaSupportFlag = false;
    cupvaContext_t ctx;
    CHECK_ERROR_GOTO(CupvaContextGetCurrent(&ctx), err, MemAllocFailed);
    CHECK_ERROR_GOTO(CupvaCudaContextSupportsCuda(ctx, &cudaSupportFlag), err, MemAllocFailed);
    if (cudaSupportFlag == false)
    {
        printf("CUDA context should be loaded before creating a PVA Stream object from a CUDA stream handle\n");
        return 1;
    }
    cupvaStream_t stream;
    CHECK_ERROR_GOTO(CupvaCudaCreateStream(&stream, cudaStream, CUPVA_PVA0, CUPVA_VPU_ANY), err, MemAllocFailed);
```

The PVA program that computes the image dynamic range is created. The steps involving the CmdProgram initialization are very similar to the previous contrast stretching based tutorials. CUDA allocated parameters and image pointers are used as the input and output for the program, respectively.

    if (ReadImageBuffer(INPUT_IMAGE_NAME, assetsDirectory, image_cpu, IMAGE_SIZE) != 0)
    {
        err = -1;
        goto MemAllocFailed;
    }

    ContrastStretchParams algParams = {
        .inputLowPixelValue           = 0,
        .outputLowPixelValue          = 0,
        .inputHighPixelValue          = 0,
        .outputHighPixelValue         = 255,
        .saturationHistogramCountLow  = IMAGE_SIZE * SATURATED_PIXEL_PERCENTAGE_LOW_INTENSITY / 100,
        .saturationHistogramCountHigh = IMAGE_SIZE * SATURATED_PIXEL_PERCENTAGE_HIGH_INTENSITY / 100};

    memcpy(stretchParams_cpu, &algParams, sizeof(ContrastStretchParams));

    cupvaExecutable_t execComputeDynamicRange;
    CHECK_ERROR_GOTO(
        CupvaExecutableCreate(&execComputeDynamicRange,
                              PVA_EXECUTABLE_DATA(cuda_interoperability_compute_dynamic_range_dev),
                              PVA_EXECUTABLE_SIZE(cuda_interoperability_compute_dynamic_range_dev)),
        err, ExecutableCreateFailed);

    int32_t createdCmdProgramCount = 0;
    cupvaCmd_t progComputeDynamicRange;

    CHECK_ERROR_GOTO(CreateComputeDynamicRangeProg(&progComputeDynamicRange, &execComputeDynamicRange, image_cuda,
                                                   IMAGE_WIDTH, IMAGE_HEIGHT, TILE_WIDTH, TILE_HEIGHT,
                                                   stretchParams_cuda, &createdCmdProgramCount),
                     err, CmdProgramCreateFailed);

progComputeDynamicRange is submitted to the imported stream. cuPVA commands submitted to the imported cuPVA stream and CUDA commands submitted to the cudaStream begin execution in the order they are submitted as if they share the same stream. More specifically, PVA commands execute following the OrderType specified by the CupvaStreamSubmit() call, while CUDA commands execute per the configuration of the cudaStream. Notice that this time we did not submit a CmdRequestFences at the end and did not call CupvaFenceInit() after submission. Synchronization is achieved with CUDA APIs as we see in the next steps.
```
    cupvaCmd_t const *cmd[1] = {&progComputeDynamicRange};
    CHECK_ERROR_GOTO(CupvaStreamSubmit(stream, cmd, NULL, 1, CUPVA_IN_ORDER, -1, -1), err, DeallocateAllResources);
```

The StretchContrast() call is submitted to the CUDA stream we had imported in the previous steps. Therefore, it is executed after the progComputeDynamicRange PVA program completes.

    StretchContrast(image_cuda, stretchParams_cuda, IMAGE_SIZE, cudaStream);

The stretch_kernel CUDA function performs image contrast stretching on the GPU. stretch_kernel is submitted to the cudaStream and runs asynchronously.

__global__ void stretch_kernel(uint8_t *image, ContrastStretchParams *stretchParams_cuda, int32_t imageSize)
{
    int32_t inputDynamicRange  = stretchParams_cuda->inputHighPixelValue - stretchParams_cuda->inputLowPixelValue + 1;
    int32_t outputDynamicRange = 256;

    int index  = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    int32_t outputPixelValue;

    for (int i = index; i < imageSize; i += stride)
    {
        outputPixelValue = outputDynamicRange * (image[i] - stretchParams_cuda->inputLowPixelValue);
        outputPixelValue /= inputDynamicRange;
        outputPixelValue = max(0, min(outputPixelValue, 255));
        image[i]         = (uint8_t)outputPixelValue;
    }
}

void StretchContrast(uint8_t *image, ContrastStretchParams *stretchParams_cuda, int32_t imageSize,
                     cudaStream_t cudaStream)
{
    int blockSize  = BLOCK_SIZE;
    int blockCount = BLOCK_COUNT;
    stretch_kernel<<<blockCount, blockSize, 0, cudaStream>>>(image, stretchParams_cuda, imageSize);
}

The cudaStreamSynchronize call blocks the host until all CUDA calls in cudaStream and PVA commands submitted to the cuPVA stream runs to completion.
```
    CHECK_CUDA_ERROR_GOTO(cudaStreamSynchronize(cudaStream), err, DeallocateAllResources);
```

The enhanced image is written to the output file and allocated resources are freed in the last step.

    if (WriteImageBuffer(OUTPUT_IMAGE_NAME, ".", image_cpu, IMAGE_SIZE) != 0)
    {
        err = -1;
        goto DeallocateAllResources;
    }

DeallocateAllResources:
CmdProgramCreateFailed:
    if (createdCmdProgramCount > 0)
    {
        CupvaCmdDestroy(&progComputeDynamicRange);
    }
    CupvaExecutableDestroy(execComputeDynamicRange);
ExecutableCreateFailed:
MemAllocFailed:
    cudaFreeHost(image_cpu);
    cudaFreeHost(stretchParams_cpu);
    cudaStreamDestroy(cudaStream);
    return err;
}

Output#

Make sure to install CUDA toolkit and CUDA cross-compile packages for your target before building this tutorial. CUDAToolkit_ROOT should be set manually in the CMakeLists.txt file for the QNX targets.

The path to the Tutorial assets directory containing the input image file “low-contrast-kodim08-768x512-grayscale.data” should be provided as an argument. The enhanced image output file “unsharp-masked-kodim08-768x512-grayscale.data” is written to the current working directory.

C++

$ ./cuda_interoperability_cpp -a <Tutorial Assets Directory Path>
Read 393216 bytes from <Tutorial Assets Directory Path>/low-contrast-kodim08-768x512-grayscale.data
Wrote 393216 bytes to ./contrast-stretched-kodim08-768x512-grayscale.data

C

$ ./cuda_interoperability_c -a <Tutorial Assets Directory Path>
Read 393216 bytes from <Tutorial Assets Directory Path>/low-contrast-kodim08-768x512-grayscale.data
Wrote 393216 bytes to ./contrast-stretched-kodim08-768x512-grayscale.data