Overview

VPI is a library that provides a collection of computer vision and image processing algorithms that can be seamlessly executed in a variety of hardware accelerators, called backends.

The goal is to provide a uniform interface to these backends, while maintaining high performance. To achieve that, several shared memory mapping mechanisms between backends are used, depending on memory characteristics, coupled with high performance implementations of algorithms and availability of backend-agnostic event synchronization mechanisms.

The VPI architectural overview is as follows:

The API follows the paradigm where object allocation and setup take place in an initialization phase. The application loop, where the main processing occurs, then follows, using the objects created during initialization. Once completed, the created objects are destroyed and the environment is cleaned up. For robotics software applications where memory allocations are limited in both time and space, the amount of memory management control provided by VPI is beneficial.

The core components of VPI include:

Context: holds the state of VPI and created objects.
Streams: acts as an asynchronous queue to where algorithms are submitted and ultimately executed sequentially on a given backend.
Buffers: stores input and output data.
Algorithms: performs operations on buffers.
Events: provides synchronization primitives.

Contexts

VPI contexts serve as a container of other VPI objects along with some configurations that apply to them.

Every host thread has an active context. VPI objects created while a context is active are owned by it.

By default all host threads use the same default context, which is created automatically by VPI. There's no need for explicit context management by the user in this case.

When some finer control on contexts is needed, there's the option for user-created contexts. This lets the user specify during context creation what backends this context supports at most, among other things. This effectively allows the user to mask support for a particular hardware. For example, creating a stream for CUDA backend fails if the current context doesn't have the VPI_BACKEND_CUDA flag set. When passing 0 as flags, VPI will inspect the running platform and enable the available backends.

Note: The CPU backend cannot be masked out and must always be supported as a fallback implementation.

Sharing objects (buffers, payloads, events, ...) among different contexts is not permitted.

There is no limit other than available memory for the number of created contexts.

The current context can be manipulated by the user if needed.

Refer to context API reference for more information.

Global Context

By default, there is a single global context created before the first VPI object is created.

This global context is initially shared among all user threads and cannot be destroyed by the user.

For most applications, the user can use the global context. When some finer control is required, for how objects are grouped together, or some level of independence between pipelines is needed, the user may wish to explicitly create and manipulate contexts.

Context Stack

Each user thread has a context stack not shared with other threads.

The top context in the stack is the current context for that thread.

By default, the context stack has one context in it, the global context. Consequently, all new threads have the same global context set as current thread.

Making a context current in a given stack amounts to replacing the top context, either the global context or the most recently pushed context, with the given context. The replaced context does not belong to the stack anymore.

However, pushing a context into a stack does not replace anything. The top context is kept in the stack and the new pushed context is put at the top, thereby becoming the new current context.

The user can push and pop contexts from the stack at will. This allows for temporarily creating pipelines in a new context without disturbing the existing context.

To avoid leakage, it is important to match the number of pushes and pops on a given context stack. Be aware that the context stack can have at most 8 contexts in it.

Streams

The main entry-point to the API is the VPIStream object. This object represents a command queue, FIFO style, storing a list of commands to execute by some backend. Commands may comprise of running a particular CV algorithm, a host function (using vpiSubmitHostFunction) or signaling an event.

At creation time, it is configured with the backends that will eventually execute the tasks submitted to it. By default, when passing 0 as flags, it'll use the backends enabled by the current context. Limiting the number of available backends helps minimize resource usage.

Each stream launches an internal worker thread that implements a task queue to handle asynchronous task execution. It is not specified when the thread is created. However, this is usually upon stream creation, lasting until it is destroyed.

Invoking any CV function on a particular backend pushes a corresponding command to the VPIStream worker thread and immediately returns. The queued commands are then dispatched to the hardware backend assigned to them for execution. This allows the API functions to be executed asynchronously with respect to the calling thread.

Refer to stream API reference for more information.

Backends

Every algorithm provided by VPI is implemented in one or more backends. Different implementations of the same algorithm return functionally similar results given the same inputs. Some small variations between backends might occur, mostly due to optimizations tailored to a particular backend, e.g. use of fixed-point instead of floating-point math, etc. Exact same results aren't to be relied upon.

CPU

Along with stream work thread, it may create a set of background worker threads and data structures supporting efficient parallel execution across multiple cores.
The implementation may choose to share the worker thread pool between different streams and/or context instances.

CUDA

Has an explicit affinity, defined during construction, with a particular CUDA device.
It manages a cudaStream_t handle and other CUDA device information that allows launching of the underlying CUDA kernels.
VPI takes advantage of the asynchronous nature of CUDA kernel launches to optimize how they are launched. In some situations, specially when no user-defined functions have been submitted to the stream, the CUDA task is launched directly from the caller thread, bypassing the work thread entirely. As a general rule, when only CUDA algorithms are involved, VPI acts as an efficient thin layer on top of CUDA SDK.
Prior the API context construction, it is up to the user to properly setup the CUDA context for the calling thread. The resulting context object uses the corresponding CUDA context for internal kernel calls.
Use of multiple GPUs isn't yet supported and might lead to undefined behavior.

PVA

The Programmable Vision Accelerator is a processor included in Jetson Xavier devices specialized in image processing and computer vision algorithms.
The PVA backend should be chosen when there's a need to leave GPU free to run other tasks that only it can do, such as Deep Learning inference stages, algorithms only implemented on CUDA backend, etc.
PVA hardware is way more power efficient than CPU and CUDA. PVA backend should be chosen if power is at premium.
There are two PVA processors in one Jetson Xavier device, each one comprising two vector processors. At most 4 independent PVA tasks can be executed concurrently at any given time.
Every instance of a VPI stream that has the PVA backend enabled chooses one available PVA vector processor in a round-robin fashion.
PVA backend isn't necessarily faster than CUDA and/or CPU on any particular algorithm.

VIC

The Video Image Compositor is a fixed-functionality processor included in Jetson devices. It is specialized in low-level image processing tasks, such as rescale, color space conversion, noise reduction, compositing, etc.
Just like PVA backend, VIC serves as a way to offload tasks off of GPU, leaving it free for other processing.

Buffers

Buffers represent the data VPI algorithms work with. Abstractions for three kinds of data are provided:

Images: stores 2-dimensional data
Arrays: for 1-dimensional data
Pyramids: an array of images with varying detail, from finer to coarse.

Users can have VPI manage allocation of all three types of buffers. Or, for images and arrays, existing memory can be wrapped into a VPI buffer. This is useful when interoperability with other libraries is required, such as using a OpenCV cv::Mat buffer as input to a VPI algorithm.

Common attributes for all buffer types are their size and the element type.

Images

VPI images represent any kind of 2D data, such as images themselves, vector fields embedded in a 2D space, 2D heat maps, etc.

The images are characterized by their width, height and format.

When creating a VPIImage objects, the flags passed during creation are used to specify which backend the images can work with. One or more VPIBackend enums can be or-ed together. Passing 0 (or no backend flag) enables the set of backends allowed by the current context, which by default is all available backends.

Refer to image API reference for more information.

Locking

Image data can be accessed from host using the vpiImageLock function. The function requires that the image have the CPU backend enabled. It will fill the VPIImageData with image information that allows the user to properly address and interpret all image pixels. Once the user is done working on the image data from host, vpiImageUnlock must be called. Once the image is locked, it can't be accessed by an algorithm running asychronously. It can, however, be locked recursively by the same thread the locked it in the first place. Just remember to pair each vpiImageLock call with a corresponding vpiImageUnlock.

Image Formats

VPI supports a variety of image formats representing different pixel types such as single-channel 8-, 16- or 32-bit, unsigned and signed, multi-channel RGB and RGBA, semi-planar NV12, etc. Not all algorithms support images with all types.

The image format is represented by the VPIImageFormat enum. Each format is defined by several components, such as color space, number of planes, data layout, etc. There are functions to extract each component from the image format, as well as modifying an existing one.

Not all algorithms support all image formats provided; however, usually several are supported.

2D images are most commonly laid out in memory in pitch-linear format, i.e. row by row, one after the other. Each row can be larger than necessary, with some padding added to the end to have properly aligned row start addresses.

There's also the option for creating or wrapping memory using a proprietary block-linear layout. Depending on the algorithm and the backend, it might be more efficient to create 2D memories using this format.

See Image formats for more information.

Wrapping External Memory

Users can create images that wrap externally allocated CUDA and host (CPU) memory using the functions vpiImageCreateCudaMemWrapper and vpiImageCreateHostMemWrapper respectively. In both cases, the user must fill a VPIImageData structure with the required information and pass it to the function.

It's also possible to wrap an EGLImage handle using vpiImageCreateEglImageWrapper and a NvBuffer using vpiImageCreateNvBufferWrapper.

In all these cases, the VPIImage object doesn't own the memory buffer. When the VPIImage is destroyed, the buffer isn't deallocated.

As with image buffers managed by VPI, these wrapping functions accept flags that define which backends they can be used with.

Arrays

VPI arrays represent 1D data, such as keypoint lists, bounding boxes, transforms, etc.

Arrays are characterized by their capacity, size and element format. As with images, the flags are used to specify which backend they can work with.

Array formats are drawn from VPIArrayType enum. Algorithms that require arrays inputs/outputs, such as KLT template tracker, usually accept one specific array format.

VPIArray has a unique feature that while the capacity of the array is fixed for the lifetime of the object, the size can change. Any API that outputs to an array will set the size parameter to the number of valid elements contained in the array. The user also has the ability to use vpiArrayGetSize and vpiArraySetSize to query and modify the size of an array.

Refer to array API reference for more information.

Locking

Array data can be accessed from host using the vpiArrayLock function. It works like its image counterpart, including recursive locking by the same thread.

Wrapping External Memory

Users can also create arrays that wrap externally allocated CUDA and host memory using the functions vpiArrayCreateCudaMemWrapper and vpiArrayCreateHostMemWrapper respectively. In both cases, the user must fill a VPIArrayData structure with the required information and pass it to the function.

Pyramids

VPI pyramids represent a collection of VPI images stacked together, all having the same format, but possibly different dimensions.

Pyramids are characterized by their number of levels, base level dimensions, scale factor and image format. The scale factor represents the ratio of one level dimension over the prior level dimension. For instance, when scale=0.5, the pyramid is dyadic, i.e., dimensions are power-of-two.

Often it's required to process one pyramid level as input or output to a VPI algorithm. The user must then use vpiImageCreatePyramidLevelWrapper specifying the pyramid and which level is to be wrapped. The returned VPIImage handle can be used as any other image. The resulting image inherits the enable backends from the pyramid. Once work on this image is done, it must be destroyed with vpiImageDestroy.

Refer to pyramid API reference for more information.

Locking

As with images and arrays, the user can access the whole pyramid data from host using the function vpiPyramidLock, provided that the pyramid is enabled for CPU backend. This function fills a VPIPyramidData structure that is basically an array of VPIImageData. Once work with VPIPyramidData is done, call vpiPyramidUnlock to unmap the pyramid from host and free resources. Recursive locking works just like images and arrays.

Events

Each compute function in the API is executed asynchronously with respect to the calling thread, i.e., returns immediately without waiting for the completion. There are two ways of synchronizing with the backend. One is to wait until all the commands in the VPIStream queue are finished by using the vpiStreamSync call. This approach, while simple, doesn't allow for fine-grained (i.e. "wait until function X is completed") or inter-stream (i.e. "before running function A in stream B, wait until function C in stream D finishes") synchronization. That's where VPIEvent objects come in. Conceptually they correspond to binary semaphores and are designed to closely mimic events in CUDA API:

Users can capture all commands submitted to a VPIStream instance in an event instance (see vpiEventRecord). The event is signaled when all captured commands have been processed and removed from VPIStream command queue.
Inter-stream synchronization is possible with the vpiStreamWaitFor call that pushes a command to VPIStream queue that blocks processing of future queued commands until the given event is signaled
Host threads can query the event's state with vpiEventQuery
Host threads can block until the event is completed with vpiEventSync.
Events can be time-stamped when completed.
The user can compute time-stamp difference between completed events in the same stream as well as between different streams.

Refer to event API reference for more information.

Thread-Safety

All API functions are thread-safe. Concurrent host access to API objects is serialized and executed in an unspecified order. All API calls use a VPIContext instance that is thread-specific and stored in TLS. If the context pointer for the current thread is NULL (no context is set), all API calls will use a default "global" context created during library initialization. API objects have no concept of thread affinity; in other words, if both threads use the same context instance, the object created in one thread can be safely destroyed by another thread.

Most of the API functions are non-blocking. Specifically, the set of functions that can block when called is limited to: vpiStreamSync, vpiStreamDestroy, vpiContextDestroy, vpiEventSync and the several vpiSubmit* functions when the stream command queue is full. Since implicit synchronization in the API implementation is minimal, it's up to the user to make sure the resulting order of dependent function calls is legal. Invalid calls, however, should be always handled gracefully (via an appropriate error code) and should not lead to application crashes or corruption of objects' internal state.

The device command queue model is loosely based on CUDA Stream API, and can be summarized with the following:

A VPIStream instance holds a queue of a jobs to be executed in one or more backends.
Calling an API function is equivalent to the host pushing a command to the queue and continuing immediately.
Backend schedules work from the stream queue when hardware resources are free.
Computer vision functions or event signaling commands are placed within the queue.
Commands in a stream instance are ordered (FIFO) and cannot overlap in time.
Commands between stream instances are un-ordered and can overlap in time.
Inter-stream synchronization and order of execution can be controlled with VPIEvent objects.
The API does not support sharing objects across different processes. All contexts are implicitly bound to the process they were created in.
Although the same algorithm payload can be used in different streams, it's user's responsibility to ensure that the payload isn't being executed concurrently. Proper synchronization using VPIEvent objects is needed in this case.

Pipeline examples, and how to implement them using VPI, are explained in the following sections.

Simple Pipeline

In this example, a pipeline with a simple box filter operation is implemented to process an input image. This is quite similar to the ImageBlurring tutorial".

The code for implementing the pipeline is as follows.

Note: For simplicity, function return values for errors are not checked. Consult the bundled samples for examples of complete, yet simple applications.

Include necessary headers. In this example, image buffers are used, a stream, and the Box Filter algorithm.

#include <vpi/Image.h>

#include <vpi/Stream.h>

#include <vpi/algo/BoxFilter.h>
Create the image buffers to be used.

int main()

{

VPIImage input, output;

vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U8, 0, &input);

vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U8, 0, &output);

A 640x480 1-channel (grayscale) input image is created with unsigned 8-bit pixel elements, represented by VPI_IMAGE_FORMAT_U8 enum. Images are initialized with zeros upon creation. By passing 0 as image flags, we inform the intent of using them possibly in all available hardware backends. This makes it easier to submit algorithms to different backends later on, despite of using more resources. The output image is created the same way.

Note
In this example, an empty input image buffer is created. However, in a real scenario an existing memory can be wrapped into a VPI image buffer, or an image from an earlier pipeline stage can be used. Consult the Image Blurring tutorial for a more complete example.
Create a stream to execute the algorithm. Passing 0 as stream flags enables the user to submit the algorithm algorithms for execution in any available hardware backend, specified later.

VPIStream stream;

vpiStreamCreate(0, &stream);
Submit the box filter algorithm to the stream, along with the input and output images, and other parameters. In this case, it's a 3x3 box filter with clamp boundary condition. It'll be executed by the CUDA backend.

vpiSubmitBoxFilter(stream, VPI_BACKEND_CUDA, input, output, 3, 3, VPI_BOUNDARY_COND_CLAMP);

In general, because of the asynchronous nature of streams, the algorithm is enqueued onto the stream's work thread, and the function returns immediately. Later on it'll be submitted for execution in the actual backend. The use of a work thread allows the program to continue assembling the processing pipeline, or do something else, while the algorithm executes in parallel.
Wait until the stream finishes processing.

vpiStreamSync(stream);

This function blocks until all algorithms submitted to the stream finish executing. This function must be called to show the output to the user, to save it to disk, etc.
Destroy created objects.

vpiStreamDestroy(stream);

vpiImageDestroy(input);

vpiImageDestroy(output);

}

Upon completion, destroy the created objects to avoid memory leaks. Destroying a stream forces it to synchronize, but destroying images that are still being used by an algorithm leads to undefined behavior, likely resulting in a program crash.

Examining how several VPI objects work together, and inspecting the ownership relationship between objects is a beneficial learning exercise.

A conceptual structure of the provided example is as follows.

Where:

Default Context is the context that is automatically created and made active in all existing or future host threads. It owns all objects created while it is active. In this example, this is the stream and the image buffers.
Stream owns a worker thread that is used to queue and dispatch tasks to the backend devices and handle synchronization. It also owns objects that represent the actual hardware backends where algorithms are eventually executed.
Box Filter algorithm gets submitted to the stream. Internally Job 1 is created with the algorithm kernel and all its parameters. It is then enqueued onto the work thread and when it's time for execution, it'll get submitted to the actual hardware. Since this algorithm doesn't have a payload (or state), there are no concerns about its lifetime.
Sync represents the vpiStreamSync call. It'll enqueue Job 2 onto the work thread, which will signal an internal event when it is executed. The calling thread will wait until the event is signaled, thereby guaranteeing that the all tasks queued so far have finished. During the synchronization operation, submissions made by other threads will block until vpiStreamSync returns.

Note: In this particular example, since the work thread is empty when the algorithm is submitted and the asynchronous nature of CUDA kernel executions, VPI will submit the algorithm directly to the CUDA device, bypassing the work thread altogether.; In general, when using only CUDA backend for algorithm submission and synchronization, VPI overhead on top of the underlying CUDA execution is minimized, and quite often negligible. The same applies for streams that are using only one of the other backends. Submitting algorithms to different backends in the same stream incurs in some small internal synchronization overhead.

Complex Pipeline

More complex scenarios can be envisioned that take advantage of different acceleration processors on the device and create a pipeline that strives to fully utilize the computational power. To do that, the pipeline must have parallelizable stages.

This next example implements a full stereo disparity estimation and Harris corners extraction pipeline, which presents plenty of parallelization opportunities.

Three stage parallelization opportunities are identified: the independent left and right image pre-processing, and Harris corners extraction. A different backend is chosen for each processing stage, depending on the processing speed of each backend, power requirements, input and output restrictions, and availability. In this example, the whole processing is split among the following backends:

VIC: does stereo pair rectification and downscaling.
CUDA: does image format conversion.
PVA: does stereo disparity calculation.
CPU: handles some pre-processing and extraction of Harris corners.

The rationale for this choice of backends is to keep the GPU free for other external processing, such as Deep Learning inference stages, etc. The image format conversion operation is quite fast on CUDA, and wouldn't interfere much. The CPU is kept busy performing Harris keypoints undisturbed.

The following diagram shows how the algorithms are split into streams and how synchronization between streams works.

Both stream left and right start stereo pair pre-processing while the keypoints stream waits until the right grayscale image is ready. Once it is, Harris corners detection starts while stream right continues pre-processing. Once pre-process on the left stream ends, it waits until the right downscaled image is ready. Finally, stereo disparity estimation starts with its two stereo inputs. The host thread can at any point issue a vpiStreamSync call in both left and keypoints stream to wait until the disparity and keypoints data is ready for further processing or display.

The code that implements this pipeline is explained as follows.

Include headers for all the objects used, as well as all the required algorithms.
#include <string.h>

#include <vpi/Array.h>

#include <vpi/Context.h>

#include <vpi/EGL.h>

#include <vpi/Event.h>

#include <vpi/Image.h>

#include <vpi/LensDistortionModels.h>

#include <vpi/Stream.h>

#include <vpi/WarpMap.h>

#include <vpi/algo/BilateralFilter.h>

#include <vpi/algo/ConvertImageFormat.h>

#include <vpi/algo/HarrisCornerDetector.h>

#include <vpi/algo/Remap.h>

#include <vpi/algo/Rescale.h>

#include <vpi/algo/StereoDisparity.h>
Start with the initialization phase, where all the required objects are created.
1. Create a context and make it active.
  
  Although the default context that is automatically created to manage the VPI state can be used, sometimes it is more convenient to create a context and use it to handle the lifetime of all objects linked to a particular pipeline. In the end, context destruction will trigger destruction of all objects created under it. This also leads to better isolation between this pipeline and others that the application might use.
  
  int main()
  
  {
  
  VPIContext ctx;
  
  vpiContextCreate(0, &ctx);
  
  vpiContextSetCurrent(ctx);
2. Create the streams.
  
  The streams are created with flags 0, meaning that they can handle tasks for all backends.
  
  There are two streams to handle the stereo pair preprocessing, and another for Harris corners detection. After preprocessing is done, stream_left is reused for stereo disparity estimation.
  
  VPIStream stream_left, stream_right, stream_keypoints;
  
  vpiStreamCreate(0, &stream_left);
  
  vpiStreamCreate(0, &stream_right);
  
  vpiStreamCreate(0, &stream_keypoints);
3. Create the input image buffer wrappers.
  
  Assuming that the input comes from a capture pipeline as EGLImage, these can be wrapped into a VPIImage to be used in a VPI pipeline. All it requires is one frame (usually the first) from each stereo input.
  
  EGLImage eglLeftFrame = /* First frame from left camera */;
  
  EGLImage eglRightFrame = /* First frame from right camera */;
  
  VPIImage left, right;
  
  vpiImageCreateEglImageWrapper(eglLeftFrame, 0, &left);
  
  vpiImageCreateEglImageWrapper(eglRightFrame, 0, &right);
4. Create the image buffers to be used.
  
  Similar to the simple pipeline, here the input images are created empty. In reality these input images must be populated by either wrapping existing memory, or by being the result of an earlier VPI pipeline.
  
  The input is a 640x480 NV12 (color) stereo pair, tipically output by camera capture pipelines. The temporary images are needed for storing intermediate results. Stereo disparity and Harris expect grayscale images, hence the format conversion. Moreover, stereo disparity expects its input to be exactly 480x270. This is accomplished by the rescale stage in the diagram above.
  
  VPIImage left_rectified, right_rectified;
  
  vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_NV12, 0, &left_rectified);
  
  vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_NV12, 0, &right_rectified);
  
  VPIImage left_grayscale, right_grayscale;
  
  vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U16, 0, &left_grayscale);
  
  vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U16, 0, &right_grayscale);
  
  VPIImage left_reduced, right_reduced;
  
  vpiImageCreate(480, 270, VPI_IMAGE_FORMAT_U16, 0, &left_reduced);
  
  vpiImageCreate(480, 270, VPI_IMAGE_FORMAT_U16, 0, &right_reduced);
  
  VPIImage disparity;
  
  vpiImageCreate(480, 270, VPI_IMAGE_FORMAT_U16, 0, &disparity);
5. Define stereo disparity algorithm parameters and create the payload.
  
  Stereo disparity processing requires some temporary data. VPI calls it payload. In this example, vpiCreateStereoDisparityEstimator is called and passed all the required parameters by the internal allocator to decide the size of the temporary data.
  
  Because the temporary data is allocated on a backend device, the payload is tightly coupled to the backend. If the same algorithm is meant to be executed in different backends, or concurrently using the same backend in different streams, it'll require one payload per backend/stream. In this example, the payload is created for execution by the PVA backend.
  
  As for algorithm parameters, the VPI stereo disparity estimator is implemented by a semi-global stereo matching algorithm. The estimator requires the census transform window size, specified as 5, and the maximum disparity levels, specified as 64. For more information, consult Stereo Disparity Estimator.
  
  VPIStereoDisparityEstimatorParams stereo_params;
  
  stereo_params.windowSize = 5;
  
  stereo_params.maxDisparity = 64;
  
  VPIPayload stereo;
  
  vpiCreateStereoDisparityEstimator(VPI_BACKEND_CUDA, 480, 270, VPI_IMAGE_FORMAT_U16, stereo_params.maxDisparity,
  
  &stereo);
6. Create the image rectification payload and corresponding parameters. It does lens distortion correction using the Remap algorithm. Here the stereo lens parameters are specified and because their are different for left and right lenses, two remap payloads are created. For more details, consult Lens Distortion Correction.
  
  VPIPolynomialLensDistortionModel dist;
  
  memset(&dist, 0, sizeof(dist));
  
  dist.k1 = -0.126;
  
  dist.k2 = 0.004;
  
  VPICameraIntrinsic Kleft = {{466.5, 0, 321.2}, {0, 466.5, 239.5}};
  
  VPICameraIntrinsic Kright = {{466.2, 0, 320.3}, {0, 466.2, 239.9}};
  
  VPICameraExtrinsic X = {{1, 0.0008, -0.0095, 0}, {-0.0007, 1, 0.0038, 0}, {0.0095, -0.0038, 0.9999, 0}};
  
  VPIWarpMap map;
  
  memset(&map, 0, sizeof(map));
  
  map.grid.numHorizRegions = 1;
  
  map.grid.numVertRegions = 1;
  
  map.grid.regionWidth[0] = 640;
  
  map.grid.regionHeight[0] = 480;
  
  map.grid.horizInterval[0] = 4;
  
  map.grid.vertInterval[0] = 4;
  
  vpiWarpMapAllocData(&map);
  
  VPIPayload ldc_left;
  
  vpiWarpMapGenerateFromPolynomialLensDistortionModel(Kleft, X, Kleft, &dist, &map);
  
  vpiCreateRemap(VPI_BACKEND_VIC, &map, &ldc_left);
  
  VPIPayload ldc_right;
  
  vpiWarpMapGenerateFromPolynomialLensDistortionModel(Kright, X, Kleft, &dist, &map);
  
  vpiCreateRemap(VPI_BACKEND_VIC, &map, &ldc_right);
7. Create output buffers for Harris keypoint detector.
  
  This algorithm receives an image and outputs two arrays, one with the keypoints themselves and another with the score of each keypoint. At most, 8192 keypoints are returned, which must be the array capacity. Keypoints are represented by the VPIKeypoint structure and scores are 32-bit unsigned values. For more information, consult Harris Corner Detector.
  
  VPIArray keypoints, scores;
  
  vpiArrayCreate(8192, VPI_ARRAY_TYPE_KEYPOINT, 0, &keypoints);
  
  vpiArrayCreate(8192, VPI_ARRAY_TYPE_U32, 0, &scores);
8. Define Harris detector parameters and create its payload.
  
  Fill the VPIHarrisCornerDetectorParams structure with the required parameters. Refer to the structure documentation for more information about each parameter.
  
  Like stereo disparity, Harris detector requires a payload. This time only the input size, 640x480 is needed. When using this payload, only inputs of this size are accepted.
  
  VPIHarrisCornerDetectorParams harris_params;
  
  harris_params.gradientSize = 5;
  
  harris_params.blockSize = 5;
  
  harris_params.strengthThresh = 10;
  
  harris_params.sensitivity = 0.4f;
  
  harris_params.minNMSDistance = 8;
  
  VPIPayload harris;
  
  vpiCreateHarrisCornerDetector(VPI_BACKEND_CPU, 640, 480, &harris);
9. Create the events to implement a barrier synchronization.
  
  Events are used for inter-stream synchronization. They are implemented by using VPIEvent. Two barriers are needed: one to wait for the input to Harris corners extraction to be ready, and another for the pre-processed right image.
  
  VPIEvent barrier_right_grayscale, barrier_right_reduced;
  
  vpiEventCreate(0, &barrier_right_grayscale);
  
  vpiEventCreate(0, &barrier_right_reduced);
The initialization is complete. Now comes the main processing phase where the pipeline is implemented by submitting algorithms and events in the correct order to the streams. This processing can occur many times in a loop using the same events, payloads, temporary, and output buffers. Naturally, the input is usually different in each iteration and are usually re-defined at each iteration, as shown below.
1. Submit the left frame processing stages.
  
  The lens distortion correction, image format conversion and downscaling are submitted to the left stream. Note again that the submit operations are non-blocking and return immediately.
  
  vpiSubmitRemap(stream_left, ldc_left, left, left_rectified, VPI_INTERP_CATMULL_ROM, VPI_BOUNDARY_COND_ZERO);
  
  vpiSubmitConvertImageFormat(stream_left, VPI_BACKEND_CUDA, left_rectified, left_grayscale, VPI_CONVERSION_CLAMP, 1,
  
  0);
  
  vpiSubmitRescale(stream_left, VPI_BACKEND_VIC, left_grayscale, left_reduced, VPI_INTERP_LINEAR,
  
  VPI_BOUNDARY_COND_CLAMP);
2. Submit the first few stages of the right frame pre-processing.
  
  The lens distortion correction and image format conversion stages will result in the grayscale image that will be input to Harris corner extraction.
  
  vpiSubmitRemap(stream_right, ldc_right, right, right_rectified, VPI_INTERP_CATMULL_ROM, VPI_BOUNDARY_COND_ZERO);
  
  vpiSubmitConvertImageFormat(stream_right, VPI_BACKEND_CUDA, right_rectified, right_grayscale, VPI_CONVERSION_CLAMP,
  
  1, 0);
3. Record the right stream state so that keypoints stream can synchronize to it.
  
  Keypoint stream can only start after its input is ready. For that, the barrier_right_grayscale event must record the right stream state by submitting a task to it that will signal the event right after the format conversion is finished.
  
  vpiEventRecord(barrier_right_grayscale, stream_right);
4. Finish the right frame pre-processing with a downscale operation.
  
  vpiSubmitRescale(stream_right, VPI_BACKEND_VIC, right_grayscale, right_reduced, VPI_INTERP_LINEAR,
  
  VPI_BOUNDARY_COND_CLAMP);
5. Record the right stream state so that left stream can synchronize to it.
  
  With the whole right preprocessing submitted, the stream state must be recorded again so that the left stream can wait until the right frame is ready.
  
  vpiEventRecord(barrier_right_reduced, stream_right);
6. Make left stream wait until the right frame is ready.
  
  Stereo disparity requires the left and right frames to be ready. vpiStreamWaitFor is used to submit a task to the left stream that will wait until the barrier_right_reduced event is signaled on the right stream, meaning that the right frame preprocessing is finished.
  
  vpiStreamWaitFor(stream_keypoints, barrier_right_grayscale);
7. Submit the stereo disparity algorithm.
  
  At this point the input images are ready. Call vpiSubmitStereoDisparityEstimator to submit the disparty estimator.
  
  vpiSubmitStereoDisparityEstimator(stream_left, stereo, left_reduced, right_reduced, disparity, &stereo_params);
8. Submit the keypoint detector pipeline.
  
  For keypoint detection, first submit a wait operation on the barrier_right_grayscale event so that it waits until the input is ready. Then submits the Harris corners detector on it.
  
  vpiSubmitHarrisCornerDetector(stream_keypoints, harris, right_grayscale, keypoints, scores, &harris_params);
9. Synchronize the streams to use the disparity map and keypoints detected.
  
  Remember that the functions called so far in processing phase are all asynchronous; they return immediately once the job is queued on the stream for later execution.
  
  Now, more processing can be performed on the main thread, such as updating some GUI status or showing the previous frame. This occurs while VPI is executing the pipeline. Once this additional processing is performed, synchronize the streams that are processing the final result from current frame using vpiStreamSync. Once completed, the resulting buffers can be accessed.
  
  vpiStreamSync(stream_left);
  
  vpiStreamSync(stream_keypoints);
10. Fetch the next frame and update the input wrappers.
  
  The existing input VPI images wrappers can be redefined to wrap the next stereo pair frames, provided that its dimensions and format are the same. This is done quite efficiently, without heap memory allocations.
  
  eglLeftFrame = /* Fetch next frame from left camera */;
  
  eglRightFrame = /* Fetch next from right camera */;
  
  vpiImageSetWrappedEglImage(left, eglLeftFrame);
  
  vpiImageSetWrappedEglImage(right, eglRightFrame);
Context destruction.

In this example, many objects were created under the current context. Once all processing is completed and the pipeline is no longer required, destroy the context. All streams will be synchronized and destroyed, along with all other objects used. No memory leaks are possible.

Destroying the current context activates the previous context that was active before the former was set to active.

vpiContextDestroy(ctx);

}

Important takeaways from these examples:

Algorithm submission returns immediately.
Algorithm execution occurs asynchronously with respect to the host thread.
Buffers can be used by different streams, although race conditions must be avoided through the use of events.
Context owns all objects created by one user thread while it is in an active state. This allows some interesting scenarios where one thread sets up the context and triggers all the processing pipeline, then moves the whole context to another thread that waits for the pipeline to end, then triggers further data processing.

VPI - Vision Programming Interface

0.4.4 Release

Overview

Contexts

Global Context

Context Stack

Streams

Backends

CPU

CUDA

PVA

VIC

Buffers

Images

Locking

Image Formats

Wrapping External Memory

Arrays

Locking

Wrapping External Memory

Pyramids

Locking

Events

Thread-Safety

Simple Pipeline

Complex Pipeline