Overview

This sample demonstrates how to use vpiSubmitCUDAHostFunction to schedule custom CUDA work (NPP calls and user-written kernels) onto a VPI stream alongside VPI algorithm submissions. By using a CUDA host function callback, all GPU work is ordered on a single stream without manual synchronization between VPI and non-VPI stages.

The sample implements a multi-stage image processing pipeline:

VIC: Convert BGR input to grayscale
NPP + custom CUDA kernel: Gaussian blur followed by a top-border overlay
VIC: Vertical flip of the result

Two execution modes are compared:

CUDA host function mode: A single VPI stream handles the entire pipeline. The NPP blur and custom kernel are submitted through vpiSubmitCUDAHostFunction, so VPI guarantees correct ordering automatically.
Sync mode: A VPI stream wrapping a CUDA stream. Manual vpiStreamSync calls separate VPI and non-VPI stages to ensure correct execution order.

When an iteration count greater than one is supplied, the sample benchmarks both modes and prints timing statistics so you can compare the overhead of manual synchronization against the host-function approach.

Instructions

The command line parameters are:

<input image> [iteration_count]

where

input image: input image file name; accepts png, jpeg, and other common formats.
iteration_count (optional): number of benchmark iterations to run; defaults to 1. When greater than 1, timing statistics are printed for both modes.

Here are some examples:

Single run (correctness check only):
./vpi_sample_22_submit_cuda_host ../assets/kodim08.png
Benchmark with 100 iterations:
./vpi_sample_22_submit_cuda_host ../assets/kodim08.png 100

Note: This sample requires a platform with VIC (Video Image Compositor) support and a CUDA-capable GPU with the NPP library.; The output images are in grayscale as the pipeline converts the input to single-channel Y8 format.

Features

vpiSubmitCUDAHostFunction: Schedule arbitrary CUDA work (NPP, custom kernels) on a VPI stream
Multi-Backend Pipeline: Combines VIC and CUDA processing in a single ordered stream
Two Execution Modes: Compares host-function-based ordering with manual synchronization
Benchmarking: Optional iteration count for performance comparison between modes
Output Verification: Validates that both modes produce identical results

Workflow

Load input image using OpenCV and convert to BGRA
Create VPI stream (native or wrapping a CUDA stream depending on mode)
Submit VIC color conversion (BGR to grayscale) to the stream
Submit NPP Gaussian blur and custom border kernel:
- Host function mode: via vpiSubmitCUDAHostFunction callback
- Sync mode: via manual stream synchronization and direct CUDA calls
Submit VIC vertical flip to the stream
Synchronize and optionally copy result to host
Verify both modes produce identical output
Save output images to disk

Results

The sample produces two output images that should be identical:

multi_backend_cuda_hostfn_pipelined.png — result from the CUDA host function mode
multi_backend_sync_pipelined.png — result from the sync mode

When run with an iteration count, timing statistics (min, max, mean, median, standard deviation) are printed for each mode.

Source Code

For convenience, here's the code that is also installed in the samples directory.

 #include <opencv2/core/version.hpp>
 #include <opencv2/opencv.hpp>
  
 #include <vpi/CUDAInterop.h>
 #include <vpi/Context.h>
 #include <vpi/Stream.h>
 #include <vpi/algo/ConvertImageFormat.h>
 #include <vpi/algo/ImageFlip.h>
 #if CV_MAJOR_VERSION >= 3
 #    include <opencv2/imgcodecs.hpp>
 #else
 #    include <opencv2/highgui/highgui.hpp>
 #endif
  
 #include "custom.cuh"
  
 #include <vpi/OpenCVInterop.hpp>
  
 #include <cuda_runtime.h>
 #include <npp.h>
 #include <nppcore.h>
 #include <nppi.h>
  
 #include <algorithm>
 #include <cassert>
 #include <chrono>
 #include <cmath>
 #include <cstdlib>
 #include <cstring>
 #include <iomanip>
 #include <iostream>
 #include <numeric>
 #include <sstream>
 #include <vector>
  
 #define CHECK_VPI_STATUS(STMT)                                \
     do                                                        \
     {                                                         \
         VPIStatus status = (STMT);                            \
         if (status != VPI_SUCCESS)                            \
         {                                                     \
             char buffer[VPI_MAX_STATUS_MESSAGE_LENGTH];       \
             vpiGetLastStatusMessage(buffer, sizeof(buffer));  \
             std::ostringstream ss;                            \
             ss << vpiStatusGetName(status) << ": " << buffer; \
             throw std::runtime_error(ss.str());               \
         }                                                     \
     } while (0);
  
 #define CHECK_CUDA_STATUS(STMT)                 \
     do                                          \
     {                                           \
         cudaError_t status = (STMT);            \
         if (status != cudaSuccess)              \
         {                                       \
             std::ostringstream ss;              \
             ss << cudaGetErrorString(status);   \
             throw std::runtime_error(ss.str()); \
         }                                       \
     } while (0);
  
 #define CHECK_NPP_STATUS(STMT)                  \
     do                                          \
     {                                           \
         NppStatus status = (STMT);              \
         if (status != NPP_SUCCESS)              \
         {                                       \
             std::ostringstream ss;              \
             ss << status;                       \
             throw std::runtime_error(ss.str()); \
         }                                       \
     } while (0);
  
 enum class PipelineMode
 {
     CUDA_HOST_FUNCTION_MODE, // One stream, vpiSubmitCUDAHostFunction for NPP + custom kernel
     SYNC_MODE                // Wrapped stream, manual sync between stages
 };
  
 static void printBenchmarkStats(std::vector<int64_t> data)
 {
     size_t n = data.size();
     std::cout << std::fixed << std::setprecision(2);
     std::cout << "Input Data Size: " << n << std::endl;
     std::cout << "\nStatistics (in microseconds):" << std::endl;
  
     // =============================
     // Calculate Min and Max
     int64_t min_val = *std::min_element(data.begin(), data.end());
     int64_t max_val = *std::max_element(data.begin(), data.end());
     std::cout << "  Min Value: " << min_val << std::endl;
     std::cout << "  Max Value: " << max_val << std::endl;
  
     // =============================
     // Calculate sum and mean
     int64_t sum = std::accumulate(data.begin(), data.end(), int64_t{0});
     double mean = ((double)sum / n);
     std::cout << "  Sum: " << (double)sum << std::endl;
     std::cout << "  Mean: " << mean << std::endl;
  
     // =============================
     // Calculate median
     std::vector<int64_t> sorted_data = data;
     std::sort(sorted_data.begin(), sorted_data.end());
     double median;
     if (n % 2 != 0)
     {
         // Odd number of elements
         median = (double)sorted_data[n / 2];
     }
     else
     {
         // Even number of elements
         int64_t mid1 = sorted_data[n / 2 - 1];
         int64_t mid2 = sorted_data[n / 2];
         median       = (double)(mid1 + mid2) / 2.0;
     }
     std::cout << "  Median: " << median << std::endl;
  
     // =============================
     // Calculate standard deviation
     long double variance_sum = 0.0;
     for (int64_t val : data)
     {
         long double diff = (long double)val - mean;
         variance_sum += diff * diff;
     }
     double std_dev = (double)std::sqrt(variance_sum / n);
     std::cout << "  Standard Deviation: " << std_dev << std::endl;
 }
  
 static VPIImageData getImgData(int width, int height, VPIByte *pBase)
 {
     VPIImageBufferPitchLinear imgData = {};
     imgData.format                    = VPI_IMAGE_FORMAT_Y8;
     imgData.numPlanes                 = 1;
     imgData.planes[0].width           = width;
     imgData.planes[0].height          = height;
     imgData.planes[0].pitchBytes      = sizeof(uint8_t) * width;
     imgData.planes[0].pixelType       = VPI_PIXEL_TYPE_INVALID;
     imgData.planes[0].offsetBytes     = 0;
     imgData.planes[0].pBase           = pBase;
     VPIImageData vpiImgData           = {};
     vpiImgData.bufferType             = VPI_IMAGE_BUFFER_CUDA_PITCH_LINEAR;
     vpiImgData.buffer.pitch           = imgData;
     return vpiImgData;
 }
  
 struct CudaHostFnData
 {
     uint8_t *cudaGrayImg;
     uint8_t *cudaBlurredImg;
     uint8_t *cudaBorderedImg;
     int width;
     int height;
     int pitchBytes;
     NppStreamContext nppCtx;
     int borderPixels;
 };
  
 static void nppBlurAndBorderCallback(cudaStream_t cudaStream, void *userData)
 {
     CudaHostFnData *data = static_cast<CudaHostFnData *>(userData);
     data->nppCtx.hStream = cudaStream;
     cudaStreamGetFlags(cudaStream, &data->nppCtx.nStreamFlags);
  
     CHECK_NPP_STATUS(nppiFilterGaussBorder_8u_C1R_Ctx(
         data->cudaGrayImg, data->pitchBytes, {data->width, data->height}, {0, 0}, data->cudaBlurredImg,
         data->pitchBytes, {data->width, data->height}, NPP_MASK_SIZE_3_X_3, NPP_BORDER_REPLICATE, data->nppCtx));
  
     submitBorderChange(data->cudaBlurredImg, data->cudaBorderedImg, static_cast<uint32_t>(data->width),
                        static_cast<uint32_t>(data->height), data->borderPixels, cudaStream);
 }
  
 static std::vector<int64_t> runPipeline(const std::string &filename, cv::Mat &output, PipelineMode mode, int numIters,
                                         bool writeOutputOnFinalIteration)
 {
     /*
      * This sample runs Tegra (VIC) and custom CUDA work in order on one stream.
      *
      * CUDA_HOST_FUNCTION_MODE: Single VPI stream. VIC convert -> vpiSubmitCUDAHostFunction
      * (NPP blur + custom border kernel) -> VIC flip. Ordering is guaranteed by VPI.
      *
      * SYNC_MODE: Single wrapped CUDA stream. Same pipeline with manual vpiStreamSync
      * between stages so NPP and custom kernel run after VIC, and VIC flip after them.
      *
      * Pipeline: BGR input -> VIC convert to grayscale -> NPP Gaussian blur ->
      * custom top border -> VIC vertical flip -> output.
      *
      * Reuses stream and images for numIters iterations. Returns per-iteration times (us).
      * Copies output to cv::Mat only on the final iteration when writeOutputOnFinalIteration.
      */
  
     VPIStream stream;
     cudaStream_t cudaStream = nullptr;
  
     if (mode == PipelineMode::CUDA_HOST_FUNCTION_MODE)
     {
         CHECK_VPI_STATUS(vpiStreamCreate(0, &stream));
     }
     else
     {
         CHECK_CUDA_STATUS(cudaStreamCreate(&cudaStream));
         CHECK_VPI_STATUS(vpiStreamCreateWrapperCUDA(cudaStream, 0, &stream));
     }
  
     NppStreamContext nppCtx;
     nppCtx.hStream = cudaStream;
     CHECK_CUDA_STATUS(cudaGetDevice(&nppCtx.nCudaDeviceId));
     cudaDeviceProp gpuProps;
     CHECK_CUDA_STATUS(cudaGetDeviceProperties(&gpuProps, nppCtx.nCudaDeviceId));
     nppCtx.nMultiProcessorCount               = gpuProps.multiProcessorCount;
     nppCtx.nMaxThreadsPerMultiProcessor       = gpuProps.maxThreadsPerMultiProcessor;
     nppCtx.nMaxThreadsPerBlock                = gpuProps.maxThreadsPerBlock;
     nppCtx.nSharedMemPerBlock                 = gpuProps.sharedMemPerBlock;
     nppCtx.nCudaDevAttrComputeCapabilityMajor = gpuProps.major;
     nppCtx.nCudaDevAttrComputeCapabilityMinor = gpuProps.minor;
     if (cudaStream != nullptr)
     {
         CHECK_CUDA_STATUS(cudaStreamGetFlags(cudaStream, &nppCtx.nStreamFlags));
     }
     else
     {
         nppCtx.nStreamFlags = 0;
     }
  
     cv::Mat bgrImg, cvImage;
     bgrImg = cv::imread(filename);
     if (bgrImg.channels() == 3)
     {
         cv::cvtColor(bgrImg, cvImage, cv::COLOR_BGR2BGRA);
     }
     else
     {
         cvImage = bgrImg;
     }
     int width  = cvImage.cols;
     int height = cvImage.rows;
  
     VPIImage input, gray, bordered, vpiOutput;
     CHECK_VPI_STATUS(vpiImageCreateWrapperOpenCVMat(cvImage, 0, &input));
     CHECK_VPI_STATUS(vpiImageCreate(width, height, VPI_IMAGE_FORMAT_Y8, 0, &vpiOutput));
  
     int pitchBytes = sizeof(uint8_t) * width;
     uint8_t *cudaGrayImg, *cudaBlurredImg, *cudaBorderedImg;
     CHECK_CUDA_STATUS(cudaMalloc(&cudaGrayImg, pitchBytes * height));
     CHECK_CUDA_STATUS(cudaMalloc(&cudaBlurredImg, pitchBytes * height));
     CHECK_CUDA_STATUS(cudaMalloc(&cudaBorderedImg, pitchBytes * height));
  
     VPIImageData grayData     = getImgData(width, height, static_cast<VPIByte *>(cudaGrayImg));
     VPIImageData borderedData = getImgData(width, height, static_cast<VPIByte *>(cudaBorderedImg));
     CHECK_VPI_STATUS(vpiImageCreateWrapper(&grayData, NULL, 0, &gray));
     CHECK_VPI_STATUS(vpiImageCreateWrapper(&borderedData, NULL, 0, &bordered));
  
     std::vector<int64_t> timings;
     timings.reserve(static_cast<size_t>(numIters));
  
     for (int iter = 0; iter < numIters; ++iter)
     {
         auto start = std::chrono::high_resolution_clock::now();
  
         if (mode == PipelineMode::CUDA_HOST_FUNCTION_MODE)
         {
             VPIConvertImageFormatParams convertFormatParams = {VPI_CONVERSION_CLAMP, 1, 0, 0, VPI_INTERP_NEAREST,
                                                                VPI_INTERP_NEAREST};
             CHECK_VPI_STATUS(vpiSubmitConvertImageFormat(stream, VPI_BACKEND_VIC, input, gray, &convertFormatParams));
  
             CudaHostFnData hostFnData = {cudaGrayImg, cudaBlurredImg, cudaBorderedImg, width,
                                          height,      pitchBytes,     nppCtx,          50};
             CHECK_VPI_STATUS(vpiSubmitCUDAHostFunction(stream, nppBlurAndBorderCallback, &hostFnData));
  
             CHECK_VPI_STATUS(vpiSubmitImageFlip(stream, VPI_BACKEND_VIC, bordered, vpiOutput, VPI_FLIP_VERT));
  
             CHECK_VPI_STATUS(vpiStreamSync(stream));
         }
         else
         {
             VPIConvertImageFormatParams convertFormatParams = {VPI_CONVERSION_CLAMP, 1, 0, 0, VPI_INTERP_NEAREST,
                                                                VPI_INTERP_NEAREST};
             CHECK_VPI_STATUS(vpiSubmitConvertImageFormat(stream, VPI_BACKEND_VIC, input, gray, &convertFormatParams));
             CHECK_VPI_STATUS(vpiStreamSync(stream));
  
             CHECK_NPP_STATUS(nppiFilterGaussBorder_8u_C1R_Ctx(cudaGrayImg, pitchBytes, {width, height}, {0, 0},
                                                               cudaBlurredImg, pitchBytes, {width, height},
                                                               NPP_MASK_SIZE_3_X_3, NPP_BORDER_REPLICATE, nppCtx));
  
             submitBorderChange(cudaBlurredImg, cudaBorderedImg, width, height, 50, cudaStream);
             CHECK_VPI_STATUS(vpiStreamSync(stream));
  
             CHECK_VPI_STATUS(vpiSubmitImageFlip(stream, VPI_BACKEND_VIC, bordered, vpiOutput, VPI_FLIP_VERT));
  
             CHECK_VPI_STATUS(vpiStreamSync(stream));
         }
  
         auto end = std::chrono::high_resolution_clock::now();
         timings.push_back(std::chrono::duration_cast<std::chrono::microseconds>(end - start).count());
  
         const bool isFinalIteration = (iter == numIters - 1);
         if (isFinalIteration && writeOutputOnFinalIteration)
         {
             VPIImageData outData;
             CHECK_VPI_STATUS(vpiImageLockData(vpiOutput, VPI_LOCK_READ, VPI_IMAGE_BUFFER_HOST_PITCH_LINEAR, &outData));
             VPIImageBufferPitchLinear &outDataPitch = outData.buffer.pitch;
             cv::Mat cvOut(height, width, CV_8U, outDataPitch.planes[0].pBase);
             output = cvOut.clone();
             CHECK_VPI_STATUS(vpiImageUnlock(vpiOutput));
         }
     }
  
     vpiImageDestroy(input);
     vpiImageDestroy(gray);
     vpiImageDestroy(bordered);
     vpiImageDestroy(vpiOutput);
     cudaFree(cudaGrayImg);
     cudaFree(cudaBlurredImg);
     cudaFree(cudaBorderedImg);
     vpiStreamDestroy(stream);
     if (cudaStream != nullptr)
     {
         cudaStreamDestroy(cudaStream);
     }
     return timings;
 }
  
 int main(int argc, char *argv[])
 {
     int retval         = 0;
     VPIContext context = nullptr;
  
     try
     {
         if (argc < 2 || argc > 3)
         {
             throw std::runtime_error(std::string("Usage: ") + argv[0] + " <input image> [iteration_count]");
         }
  
         const std::string filename = argv[1];
         int iterationCount         = 1;
         if (argc == 3)
         {
             iterationCount = std::atoi(argv[2]);
             if (iterationCount < 1)
             {
                 throw std::runtime_error("iteration_count must be >= 1.");
             }
         }
  
         CHECK_VPI_STATUS(vpiContextCreate(VPI_BACKEND_CPU | VPI_BACKEND_CUDA | VPI_BACKEND_VIC, &context));
         CHECK_VPI_STATUS(vpiContextSetCurrent(context));
  
         cv::Mat hostFnOutput;
         cv::Mat syncOutput;
  
         if (iterationCount > 1)
         {
             std::cout << "Benchmark: CUDA host function mode (" << iterationCount << " iterations)" << std::endl;
             std::vector<int64_t> hostFnTimings =
                 runPipeline(filename, hostFnOutput, PipelineMode::CUDA_HOST_FUNCTION_MODE, iterationCount, false);
             std::cout << "--- CUDA host function mode ---" << std::endl;
             printBenchmarkStats(hostFnTimings);
             std::cout << std::endl;
  
             std::cout << "Benchmark: Sync mode (" << iterationCount << " iterations)" << std::endl;
             std::vector<int64_t> syncTimings =
                 runPipeline(filename, syncOutput, PipelineMode::SYNC_MODE, iterationCount, false);
             std::cout << "--- Sync mode ---" << std::endl;
             printBenchmarkStats(syncTimings);
             std::cout << std::endl;
         }
  
         runPipeline(filename, hostFnOutput, PipelineMode::CUDA_HOST_FUNCTION_MODE, 1, true);
         runPipeline(filename, syncOutput, PipelineMode::SYNC_MODE, 1, true);
  
         if (hostFnOutput.size() != syncOutput.size() || hostFnOutput.type() != syncOutput.type())
         {
             throw std::runtime_error("FAIL: CUDA host function and sync outputs differ in size or type.");
         }
  
         cv::Mat diff;
         cv::absdiff(hostFnOutput, syncOutput, diff);
         int numDiffPixels = cv::countNonZero(diff);
         if (numDiffPixels != 0)
         {
             std::ostringstream ss;
             ss << "FAIL: Outputs differ (" << numDiffPixels << " pixels).";
             throw std::runtime_error(ss.str());
         }
  
         std::cout << "PASS: CUDA host function and sync mode outputs are identical." << std::endl;
  
         cv::imwrite("multi_backend_cuda_hostfn_pipelined.png", hostFnOutput);
         cv::imwrite("multi_backend_sync_pipelined.png", syncOutput);
     }
     catch (std::exception &e)
     {
         std::cerr << e.what() << std::endl;
         retval = 1;
     }
  
     vpiContextDestroy(context);
  
     if (retval == 0)
     {
         // The Jetson CUDA/EGL driver stack can abort in process-global finalizers
         // after this CUDA-runtime sample has already released its resources.
         std::cout.flush();
         std::cerr.flush();
         std::_Exit(EXIT_SUCCESS);
     }
  
     return retval;
 }