VPI - Vision Programming Interface

1.0 Release

Architecture

Overview

VPI is a software library that provides a collection of computer vision and image processing algorithms that can be seamlessly executed in a variety of hardware accelerators, called backends.

The goal of VPI is to provide a uniform interface to the computing backends while maintaining high performance. This is achieved by exposing a thin and performant software abstraction of the underlying hardware and the data it manipulates.

This diagram illustrates the architecture of VPI:

The API follows a paradigm in which object allocation and setup take place in an initialization phase. Following is the application loop, where the main processing occurs, using the objects created during initialization. When main processing is complete the created objects are destroyed and the environment is cleaned up. In embedded, resource-constrained environments, where memory allocations are limited in both time and space, the control over memory allocation and lifetime provided by VPI is beneficial.

The core components of VPI include:

  • Algorithms: Represent an undivisible compute operation.
  • Backends: Represent hardware engines responsible for actual computation.
  • Streams: Act as an asynchronous queue to where algorithms are submitted and ultimately executed sequentially on a given backend. Streams and event are building blocks of computing pipelines.
  • Buffers: Store input and output data.
  • Events: Provide synchronization primitives among streams and/or the application thread.
  • Contexts: Hold the state of VPI and created objects.

Algorithms

Algorithms represent the actual computing. They act on one or more input buffers and write their result in output buffers provided by the user. They run asynchronously with respect to the application thread. For a list of supported algorithms, please refer to Algorithms section.

There are two classes of algorithms:

  • Algorithms that require a payload.
  • Payload-less algorithms.

Algorithm Payload

Some algorithm implementations such as FFT or KLT Feature Tracker require temporary resources to function properly. These resources are encapsulated by a VPIPayload object associated with the algorithm.

At initialization time, user is required to create the corresponding payload, passing some parameters that are used to allocate the temporary resources, along with which backends will potentially execute the algorithm. During the main loop, where computation is actually performed, an algorithm instance is submitted to a stream for execution. The corresponding payload is sent along with input and output parameters. The payload can be reused in multiple algorithm instances, but user must guarantee that the payload isn't used concurrently by different instances.

Once the payload isn't needed anymore, it must be destroyed by calling vpiPayloadDestroy. This will deallocate any resources it encapsulates.

Examples:

  • FFT payload creation to be used by CUDA backend, to be done only once.
    VPIStatus vpiCreateFFT(uint32_t backends, int32_t inputWidth, int32_t inputHeight, const VPIImageFormat inFormat, const VPIImageFormat outFormat, VPIPayload *payload)
    Creates payload for direct Fast Fourier Transform algorithm.
    @ VPI_IMAGE_FORMAT_2F32
    Single plane with two interleaved 32-bit floating point channels.
    Definition: ImageFormat.h:122
    @ VPI_IMAGE_FORMAT_F32
    Single plane with one 32-bit floating point channel.
    Definition: ImageFormat.h:116
    struct VPIPayloadImpl * VPIPayload
    A handle to an algorithm payload.
    Definition: Types.h:209
    @ VPI_BACKEND_CUDA
    CUDA backend.
    Definition: Types.h:92
  • FFT algorithm submission to a stream, called as many times as necessary, possibly with different inputs and outputs.
    vpiSubmitFFT(stream, VPI_BACKEND_CUDA, fft, inputF32, spectrum, 0);
    VPIStatus vpiSubmitFFT(VPIStream stream, uint32_t backend, VPIPayload payload, VPIImage input, VPIImage output, uint32_t flags)
    Runs the direct Fast Fourier Transform on single image.
  • Payload destruction, to be executed once processing is done and payload isn't needed anymore.
    void vpiPayloadDestroy(VPIPayload payload)
    Deallocates the payload object and all associated resources.

Payload-Less Algorithms

Not all algorithms require temporary resources. This is the case of Box Filter, Rescale, among others. For these algorithm, the API is simplified and no payload handling is necessary. All data required is sent during algorithm submission.

Example:

  • Box filter algorithm submission to CUDA backend, to be called as many times as necessary.
    vpiSubmitBoxFilter(stream, VPI_BACKEND_CUDA, input, output, 5, 5, VPI_BORDER_ZERO);
    VPIStatus vpiSubmitBoxFilter(VPIStream stream, uint32_t backend, VPIImage input, VPIImage output, int32_t kernelSizeX, int32_t kernelSizeY, VPIBorderExtension border)
    Runs a 2D box filter over an image.
    @ VPI_BORDER_ZERO
    All pixels outside the image are considered to be zero.
    Definition: Types.h:219

Backends

Every algorithm supported by VPI is implemented in one or more backends. Different implementations of the same algorithm return similar results when given the same inputs, but small variations between their results may occur, mostly due to optimizations tailored to a particular backend, such as use of fixed-point instead of floating-point math.

CPU

This backend represents the device's CPU. It may create a set of background worker threads and data structures supporting efficient parallel execution across multiple cores. These worker threads might be shared among different streams and/or context instances.

VPI provides mechanisms to allow the user to define their own CPU task scheduling scheme by calling vpiContextSetParallelFor with a user-provided VPIParallelForCallback function that VPI calls when CPU tasks need to be executed.

CUDA

The CUDA backend has an explicit affinity, defined during construction of a stream, with a particular CUDA-enabled GPU. This means that algorithms submitted for execution on this stream will handled by this GPU.

The CUDA backend manages a cudaStream_t handle and other CUDA device information that allows it to launch the underlying CUDA kernels.

VPI takes advantage of the asynchronous nature of CUDA kernel launches to optimize their launching. In some situations, especially when no user-defined functions have been submitted to the stream, the backend launches the CUDA task directly from the caller thread, bypassing the worker thread entirely.

When only CUDA algorithms are involved, VPI generally acts as an efficient thin layer on top of the CUDA SDK.

The user must properly set up the CUDA context for the calling thread before the API context is constructed. The resulting context object uses the corresponding CUDA context for internal kernel calls.

Note
Use of multiple GPUs is not currently supported, and may cause undefined behavior. Any results cannot be relied upon.

PVA

The Programmable Vision Accelerator (PVA) is a processor in Jetson AGX Xavier and Jetson Xavier NX devices that is specialized for image processing and computer vision algorithms.

Use the PVA backend when you need to leave the GPU free to run other tasks that only it can perform, such as deep learning inference stages and algorithms only implemented on CUDA backend.

PVA hardware is much more power efficient than CPU and CUDA hardware. Thus you should use the PVA backend where possible when if power is at a premium.

Each Jetson AGX Xavier or Jetson Xavier NX device comprises two PVA processors, each one comprising two vector processors. Thus the device can execute at most four independent PVA tasks concurrently.

When multiple VPI streams have the PVA backend enabled, they each choose one available PVA vector processor in round-robin succession.

Note
: A PVA backend is not necessarily faster than a CUDA or CPU backend for any particular algorithm.

VIC

The Video Image Compositor (VIC) is a fixed-functionality processor in Jetson devices that is specialized for low-level image processing tasks, such as rescaling, color space conversion, noise reduction, and compositing.

Like a PVA backend, a VIC backend allows you to offload tasks from the GPU, leaving it free for other processing, if performance isn't at premium.

Streams

The VPIStream object is the main entry point to the API. It is loosely based on CUDA's cudaStream_t. This object represents a FIFO command queue which stores a list of commands to be executed by some backend. The commands may run a particular CV algorithm, perform a host function (using vpiSubmitHostFunction), or signal an event.

At creation time, a stream is configured to use the backends that are to execute the tasks submitted to it. By default, it uses the backends enabled by the current context. You can set flags when you create it to further limit the number of available backends and reduce resource usage.

Each stream launches an internal worker thread that implements a task queue to handle asynchronous task execution. The task queue is not defined when the thread is created, but is usually defined when the stream is created, and exists until it is destroyed.

When you invoking a CV function on a particular backend, the function pushes a corresponding command to the VPIStream worker thread and immediately returns. The queued commands are dispatched to the hardware backend assigned to them for execution. This allows API functions to be executed asynchronously with respect to the calling thread.

For more information, see Stream in the "API Reference" section of VPI - Vision Programming Interface.

Buffers

Buffers represent the data that VPI algorithms work with. VPI supports abstractions for three kinds of data:

  • Image: Stores 2-dimensional data
  • Array: Stores 1-dimensional data
  • Pyramid: Holds an array of images with varying amounts of detail, from fine to coarse.

VPI can allocate all three types of buffers. For images and arrays, it can wrap data into a VPI buffer and store it in pre-allocated memory. This is useful when an application requires interoperability with libraries other than VPI, as when it uses an OpenCV cv::Mat buffer as input to a VPI algorithm.

All buffer types share the attributes of size and element type.

Images

VPI images represent any kind of 2D data, such as actual images, vector fields embedded in a 2D space, and 2D heat maps.

VPI images are characterized by their size (width and height) and format.

When an application creates an VPIImage object, it passes flags that specify which backend the image can work with. You can set the flags with one of VPIBackend enums, or with two or more OR'ed together. When no backend flags are passed, VPI enables all of the backends allowed by the current context, which by default is all available backends.

For more information, see Image in the "API Reference" section of VPI - Vision Programming Interface.

Locking

In order to access the image data from your application, it needs to be locked first by vpiImageLock function. This operation guarantees that all changes made to the memory are committed and made available to CPU memory.

This operation requires that the image have the CPU backend enabled. vpiImageLock fills the VPIImageData object with image information that allows you to address and interpret all image pixels properly. When you are done working on the image data with host, you must call vpiImageUnlock must be called.

When an image is locked, it can't be accessed by an algorithm running asychronously. It can, however, be locked recursively by the same thread that locked it in the first place. Just remember to pair each vpiImageLock call with a corresponding vpiImageUnlock.

Image Formats

VPI supports a variety of image formats representing different pixel types such as single-channel 8, 16, or 32-bit, unsigned and signed, multi-channel RGB and RGBA, semi-planar NV12, etc.

The image format is represented by the VPIImageFormat enum. Each format is defined by several attributes, such as color space, number of planes, and data layout. There are functions to extract each component from the image format, as well as modifying an existing one.

Not all algorithms support all recognized image formats. Most offer a choice of several formats, though. The supported formats are described in each algorithm description page, e.g. Bilateral Filter and Separable Convolution.

2D images are most commonly laid out in memory in pitch-linear format, i.e. row by row, one after another. Each row can be larger than necessary to hold the image's data to conform with row address alignment restrictions.

You can also create or wrap memory using a proprietary block-linear layout. For some algorithms and backends it can be more efficient to create 2D memories using this format.

For more information, see Image Formats in the "API Reference" section of VPI - Vision Programming Interface.

Wrapping External Memory

You can create images that wrap externally allocated CUDA and CPU memory using the functions vpiImageCreateCUDAMemWrapper and vpiImageCreateHostMemWrapper, respectively. In each case, you must fill a VPIImageData structure with the required information and pass it to the function.

You can also wrap an EGLImage handle using vpiImageCreateEGLImageWrapper, and a NvBuffer using vpiImageCreateNvBufferWrapper.

In all of these cases, the VPIImage object does not own the memory buffer. When the VPIImage is destroyed, the buffer is not deallocated.

Like the function for creating image buffers managed by VPI, these wrapping functions accept flags that define which backends they can be used with.

Arrays

VPI arrays represent 1D data, such as keypoint lists, bounding boxes, and transforms.

Arrays are characterized by their capacity, size, and element type. As with images, the flags are used to specify which backends they can work with.

Array types are drawn from enum VPIArrayType. Algorithms that require arrays for input or output, such as the KLT template tracker, usually accept one specific array type.

VPIArray has a unique feature: while the capacity of an array is fixed for the lifetime of the object, its size can change. Any API that outputs to an array must set the size parameter to the number of valid elements in the array. You can use vpiArrayGetSize and vpiArraySetSize to query and modify the size of an array.

For more information, see Array in the "API Reference" section of VPI - Vision Programming Interface.

Locking

Array data can be accessed from the host using the vpiArrayLock function. This function works like its image counterpart, including recursive locking by the same thread.

Wrapping External Memory

You can also create arrays that wrap externally allocated CUDA and host memory using the functions vpiArrayCreateCUDAMemWrapper and vpiArrayCreateHostMemWrapper, respectively. In both cases, you must fill a VPIArrayData structure with the required information and pass it to the function.

Pyramids

VPI pyramids represent a collection of VPI images stacked together, all with the same format, but possibly with different dimensions.

A pyramid is characterized by its number of levels, base level dimensions, scale factor, and image format. The scale factor represents the ratio of one level dimension over the prior level dimension. For instance, when scale=0.5, the pyramid is dyadic, i.e., dimensions are power-of-two.

Often it's necessary to process one pyramid level as the input or output of a VPI algorithm. Then you must use vpiImageCreatePyramidLevelWrapper to specify the pyramid and which pyramid level is to be wrapped. The resulting image inherits the pyramid's enabled backends. You can use the returned VPIImage handle like any other image. When you are done using the image, you must destroy it with vpiImageDestroy.

For more information, see Pyramid in the "API Reference" section of VPI - Vision Programming Interface.

Locking

As with images and arrays, you can access the whole pyramid from the host using the function vpiPyramidLock, provided that the pyramid is enabled for the CPU backend. This function fills a VPIPyramidData structure that contains an array of VPIImageData. When you are done using the VPIPyramidData, call vpiPyramidUnlock to unmap the pyramid from the host and free its resources.

Recursive locking works for pyramids just as images and arrays.

Events

Each compute function in the API is executed asynchronously with respect to the calling thread; that is, it returns immediately rather than waiting for the operation to complete. There are two ways to synchronize the operation with the backend.

One method is to wait until all of the commands in the VPIStream queue are finished by calling vpiStreamSync. This method is simple, but it can't provide synchronization that is fine-grained (e.g., "wait until function X is completed") or inter-stream (e.g., "wait until function C in stream D completes before running function A in stream B").

The other method provides more flexible synchronization by using VPIEvent objects. These objects are conceptually like binary semaphores, and are designed to mimic events in CUDA API closely:

  • You can capture all commands submitted to a VPIStream instance in an event instance (see vpiEventRecord). The event is signaled when all captured commands have been processed and removed from VPIStream command queue.
  • You can perform inter-stream synchronization with the vpiStreamWaitEvent call, which pushes a command to the VPIStream queue that blocks processing of future queued commands until the given event is signaled.
  • The application can query the event's state with vpiEventQuery.
  • Application threads can block until the event is completed with vpiEventSync.
  • Events can be timestamped when completed.
  • You can compute the difference between timestamps on completed events in the same stream as well as between different streams.

For more information, see Event in the "API Reference" section of VPI - Vision Programming Interface.

Contexts

A context encapsulates all resources used by VPI to perform operations. It automatically cleans up these resources when the context is destroyed.

Every application CPU thread has an active context. Each context owns the VPI objects created while it is active.

By default, all application threads are associated with the same global context, which is created automatically by VPI when the first VPI resource is created. You do not need to perform any explicit context management in this case, everything is handled by VPI under the hood.

When finer control of contexts is needed, user-created contexts are an option. Once created, the context can be pushed to the current application thread's context stack, or replace the current context. Both actions make the created context active. Refer to Context Stack for more information on how to manipulate contexts.

Upon context creation, the user can specify several properties associated with it, such as which backends are supported by the created objects when the context is active. This effectively allows you to mask support for particular backend. For example, stream creation for a CUDA backend fails if the current context doesn't have the VPI_BACKEND_CUDA flag set. When user doesn't pass backend flags, it inspects the running platform and enables the backends associated with all available hardware engines.

Note
The CPU backend cannot be masked out, and must always be supported as a fallback implementation.

Another use of contexts is to isolate independent

Objects (buffers, payloads, events, etc.) cannot be shared among different contexts.

There is no limit to the number of created contexts except available memory.

You can manipulate the current context if needed.

For more information, see Context in the "API Reference" section of VPI - Vision Programming Interface.

Global Context

By default, VPI creates a single global context before it creates any VPI objects. This global context is initially shared among all application threads, and cannot be destroyed by the user.

For most use cases, an application can use the global context. When an application requires finer control of how objects are grouped together, or it needs a level of independence between pipelines, you may want to create and manipulate contexts explicitly.

Context Stack

Each application thread has a context stack not shared with other threads.

The top context in the stack is the current context for that thread.

By default, the context stack has one context in it, the global context. Consequently, all new threads have the same global context set as current thread.

Making a context current in a given stack amounts to replacing the top context, either the global context or the most recently pushed context, with the given context. The replaced context does not belong to the stack anymore.

However, pushing a context into a stack does not replace anything. The top context is kept in the stack and the new pushed context is put at the top, thereby becoming the new current context.

The user can push and pop contexts from the stack at will. This allows for temporarily creating pipelines in a new context without disturbing the existing context.

To avoid leakage, it is important to match the number of pushes and pops on a given context stack. Be aware that the context stack can have at most 8 contexts in it.

Thread Safety

All API functions are thread-safe. Concurrent host access to API objects is serialized and executed in an unspecified order. All API calls use a VPIContext instance that is thread-specific and is stored in Thread Local Storage (TLS). If the context pointer for the current thread is NULL (no context is set), all API calls use a default "global" context created during library initialization.

API objects have no concept of thread affinity; that is, if several threads use the same context instance, an object created in one thread can safely be destroyed by another thread.

Most of the API functions are non-blocking. The functions that can block when called are vpiStreamSync, vpiStreamDestroy, vpiContextDestroy, vpiEventSync and the several vpiSubmit* functions (which block when the stream command queue is full). Since implicit synchronization in the API implementation is minimal, you must ensure that the resulting order of dependent function calls is legal.

Pipeline examples, and how to implement them using VPI, are explained in the following sections.

Simple Pipeline

In this example, a pipeline with a simple box filter operation is implemented to process an input image. This is quite similar to the Image Blurring tutorial.

The code for implementing the pipeline is as follows.

Note
For simplicity, function return values for errors are not checked. Consult the bundled samples for examples of simple but complete applications.
  1. Include the necessary headers. This example needs headers for image buffers, a stream, and the Box Filter algorithm.

    #include <vpi/Image.h>
    #include <vpi/Stream.h>
    Declares functions that implement the Box Filter algorithm.
    Functions and structures for dealing with VPI images.
    Declares functions dealing with VPI streams.
  2. Create the image buffers to be used.

    int main()
    {
    VPIImage input, output;
    vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U8, 0, &input);
    vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U8, 0, &output);
    @ VPI_IMAGE_FORMAT_U8
    Single plane with one 8-bit unsigned integer channel.
    Definition: ImageFormat.h:104
    struct VPIImageImpl * VPIImage
    A handle to an image.
    Definition: Types.h:197
    VPIStatus vpiImageCreate(int32_t width, int32_t height, VPIImageFormat fmt, uint32_t flags, VPIImage *img)
    Create an empty image instance with the specified flags.

    The example creates a 640x480 1-channel (grayscale) input image with unsigned 8-bit pixel elements, which are represented by VPI_IMAGE_FORMAT_U8 enum. VPI initializes images with zeros upon creation. Pass all-zero image flags to indicate that this image may be used in all available hardware backends. This makes it easier to submit algorithms to different backends later on, at the cost of using more resources. The output image is created the same way.

    Note
    This example creates an empty input image buffer, but in a real use case an existing memory buffer could be wrapped into a VPI image buffer, or an image from an earlier pipeline stage could be used. See the Image Blurring tutorial for a more complete example.
  3. Create a stream to execute the algorithm. Pass all-zero stream flags to indicate that the algorithm may be executed in any available hardware backend, to be specified later.

    VPIStream stream;
    vpiStreamCreate(0, &stream);
    struct VPIStreamImpl * VPIStream
    A handle to a stream.
    Definition: Types.h:191
    VPIStatus vpiStreamCreate(uint32_t flags, VPIStream *stream)
    Create a stream instance.
  4. Submit the box filter algorithm to the stream, along with the input and output images and other parameters. In this case, the filter algorithm is a 3x3 box filter with clamp boundary condition. It is to be executed by the CUDA backend.

    vpiSubmitBoxFilter(stream, VPI_BACKEND_CUDA, input, output, 3, 3, VPI_BORDER_CLAMP);
    @ VPI_BORDER_CLAMP
    Border pixels are repeated indefinitely.
    Definition: Types.h:220

    In general, because of the asynchronous nature of streams, the algorithm is enqueued on the stream's work thread, and the function returns immediately. Later it is submitted for execution in the actual backend. Using a work thread allows the program to continue assembling the processing pipeline, or do some other task, while the algorithm executes in parallel.

  5. Wait until the stream finishes processing.

    vpiStreamSync(stream);
    VPIStatus vpiStreamSync(VPIStream stream)
    Blocks the calling thread until all submitted commands in this stream queue are done (queue is empty)...

    This function blocks until all algorithms submitted to the stream finish executing. The pipeline must do this before it can show output to the user, save it to disk, etc.

  6. Destroy created objects.

    vpiImageDestroy(output);
    return 0;
    }
    void vpiImageDestroy(VPIImage img)
    Destroy an image instance.
    void vpiStreamDestroy(VPIStream stream)
    Destroy a stream instance and deallocate all HW resources.

    When the pipeline finishes using the created objects, it destroys them to prevent memory leaks. Destroying a stream forces it to synchronize, but destroying an image that is still being used by an algorithm leads to undefined behavior, most likely resulting in a program crash.

As a useful learning exercise, NVIDIA recommends examining how several VPI objects work together, and inspecting the ownership relationship between objects.

This is a conceptual structure of the provided example:

Where:

  • Default Context is the context that is automatically created and made it is active. In this example, the default context is the stream and the image buffers.
  • Stream owns a worker thread that queues and dispatches tasks to the backend devices and handles synchronization. It also owns objects that represent the actual hardware backends where algorithms are eventually executed.
  • Box Filter is the algorithm that is submitted to the stream. Internally Job 1 is created with the algorithm kernel and all its parameters. It is then enqueued on the work thread, which submits it to the actual hardware when all previous tasks submitted to it are completed. Since this algorithm doesn't have a payload (or state), there are no concerns about its lifetime.
  • Sync represents the vpiStreamSync call. It enqueues Job 2 onto the work thread, and the job signals an internal event when it is executed. The calling thread waits until the event is signaled, guaranteeing that all tasks queued so far have finished. Submissions by other threads are blocked until vpiStreamSync returns.
Note
In this particular example, since the work thread is empty when the algorithm is submitted and the asynchronous nature of CUDA kernel executions, VPI will submit the algorithm directly to the CUDA device, bypassing the work thread altogether.
When only the CUDA backend is used for algorithm submission and synchronization, VPI overhead on top of the underlying CUDA execution is generally minimized, and is often negligible. The same is true of streams that use only one of the other backends. Submitting algorithms to different backends in the same stream incurs in a small internal synchronization overhead.

Complex Pipeline

More complex scenarios may take advantage of different acceleration processors on the device and create a pipeline that strives to utilize its full computational power. To do that, the pipeline must have parallelizable stages.

The next example implements a full stereo disparity estimation and Harris corners extraction pipeline, which presents plenty of opportunities for parallelization.

The diagram reveals three stage parallelization opportunities: the independent left and right image pre-processing, and the Harris corners extraction. The pipeline uses a different backend for each processing stage, depending on each backend's processing speed, power requirements, input and output restrictions, and availability. In this example, processing is split among the following backends:

  • VIC: Does stereo pair rectification and downscaling.
  • CUDA: Does image format conversion.
  • PVA: Does stereo disparity calculation.
  • CPU: Handles some pre-processing and extraction of Harris corners.

This choice of backends keeps the GPU free for processing other tasks, such as deep learning inference stages. The image format conversion operation is quite fast on CUDA, and will not interfere much. The CPU is kept busy performing Harris keypoints undisturbed.

The following diagram shows how the algorithms are split into streams and how synchronization between streams works.

Both stream left and stream right start stereo pair preprocessing, while the keypoints stream waits until the right grayscale image is ready. Once it is, Harris corners detection starts while stream right continues pre-processing. When pre-processing ends on the left stream, the stream waits until the right downscaled image is ready. Finally, stereo disparity estimation starts with its two stereo inputs. At any point the host thread can issue a vpiStreamSync call in both left and keypoints stream to wait until the disparity and keypoints data is ready for further processing or display.

This outline explains the code that implements this pipeline:

  1. Include headers for all the objects used, as well as all of the required algorithms.
    #include <string.h>
    #include <vpi/Array.h>
    #include <vpi/Context.h>
    #include <vpi/EGLInterop.h>
    #include <vpi/Event.h>
    #include <vpi/Image.h>
    #include <vpi/Stream.h>
    #include <vpi/WarpMap.h>
    #include <vpi/algo/Remap.h>
    Functions and structures for dealing with VPI arrays.
    Declares functions that implement the Bilateral Filter algorithm.
    Functions and structures for dealing with VPI contexts.
    Declares functions that handle image format conversion.
    Functions to handle EGL interoperability with VPI.
    Functions and structures for dealing with VPI events.
    Declares functions that implement the Harris Corner Detector algorithm.
    Declares functions to generate warp maps based on common lens distortion models.
    Declares functions that implement the Remap algorithm.
    Declares functions that implement the Rescale algorithm.
    Declares functions that implement stereo disparity estimation algorithms.
    Declares functions that implement the WarpMap structure and related functions.
  2. Execute the initialization phase, where all the required objects are created.
    1. Create a context and make it active.

      Although you can use the default context that is created automatically to manage the VPI state, it is may be more convenient to create a context and use it to handle all objects linked to a particular pipeline throughout their lifetimes. In the end, context destruction triggers destruction of the objects created under it. Using a dedicated context also yields better isolation between this pipeline and others that the application may use.

      int main()
      {
      vpiContextCreate(0, &ctx);
      VPIStatus vpiContextSetCurrent(VPIContext ctx)
      Sets the context for the calling thread.
      VPIStatus vpiContextCreate(uint32_t flags, VPIContext *ctx)
      Create a context instance.
      struct VPIContextImpl * VPIContext
      A handle to a context.
      Definition: Types.h:179
    2. Create the streams.

      The streams are created with all zero flags, meaning that they can handle tasks for all backends.

      There are two streams to handle stereo pair preprocessing, and a third for Harris corners detection. When preprocessing is finished, stream_left is reused for stereo disparity estimation.

      VPIStream stream_left, stream_right, stream_keypoints;
      vpiStreamCreate(0, &stream_left);
      vpiStreamCreate(0, &stream_right);
      vpiStreamCreate(0, &stream_keypoints);
    3. Create the input image buffer wrappers.

      Assuming that the input comes from a capture pipeline as EGLImage, you can wrap the buffers in a VPIImage to be used in a VPI pipeline. All the pipeline requires is one frame (usually the first) from each stereo input.

      EGLImage eglLeftFrame = /* First frame from left camera */;
      EGLImage eglRightFrame = /* First frame from right camera */;
      VPIImage left, right;
      vpiImageCreateEGLImageWrapper(eglLeftFrame, NULL, 0, &left);
      vpiImageCreateEGLImageWrapper(eglRightFrame, NULL, 0, &right);
      VPIStatus vpiImageCreateEGLImageWrapper(EGLImageKHR eglImage, const VPIWrapEGLImageParams *params, uint32_t flags, VPIImage *img)
      Create an image object by wrapping an existing EGLImage.
    4. Create the image buffers to be used.

      Like the simple pipeline, this pipeline creates the input images empty. In reality these input images must be populated by either wrapping existing memory, or by being the result of an earlier VPI pipeline.

      The input is a 640x480 NV12 (color) stereo pair, typically output by camera capture pipelines. The temporary images are needed for storing intermediate results. The format conversion is necessary because the stereo disparity estimator and Harris corner extracter expect grayscale images. Moreover, stereo disparity expects its input to be exactly 480x270. This is accomplished by the rescale stage in the diagram above.

      VPIImage left_rectified, right_rectified;
      vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_NV12_ER, 0, &left_rectified);
      vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_NV12_ER, 0, &right_rectified);
      VPIImage left_grayscale, right_grayscale;
      vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U16, 0, &left_grayscale);
      vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U16, 0, &right_grayscale);
      VPIImage left_reduced, right_reduced;
      vpiImageCreate(480, 270, VPI_IMAGE_FORMAT_U16, 0, &left_reduced);
      vpiImageCreate(480, 270, VPI_IMAGE_FORMAT_U16, 0, &right_reduced);
      VPIImage disparity;
      vpiImageCreate(480, 270, VPI_IMAGE_FORMAT_U16, 0, &disparity);
      @ VPI_IMAGE_FORMAT_NV12_ER
      YUV420sp 8-bit pitch-linear format with full range.
      Definition: ImageFormat.h:152
      @ VPI_IMAGE_FORMAT_U16
      Single plane with one 16-bit unsigned integer channel.
      Definition: ImageFormat.h:110
    5. Define stereo disparity algorithm parameters and create the payload.

      Stereo disparity processing requires some temporary data. VPI calls it payload. In this example, vpiCreateStereoDisparityEstimator is called and passed all of the required parameters by the internal allocator to specify the size of the temporary data.

      Because the temporary data is allocated on a backend device, the payload is tightly coupled to the backend. If the same algorithm is meant to be executed in different backends, or concurrently using the same backend in different streams, it requires a payload for each backend or stream. In this example, the payload is created for execution by the PVA backend.

      As for algorithm parameters, the VPI stereo disparity estimator is implemented by a semi-global stereo matching algorithm. The estimator requires the census transform window size, specified as 5, and the maximum disparity levels, specified as 64. For more information, see Stereo Disparity Estimator.

      stereo_params.windowSize = 5;
      stereo_params.maxDisparity = 64;
      stereo_params.maxDisparity = stereo_params.maxDisparity;
      VPIPayload stereo;
      &stereo);
      int32_t windowSize
      Width of Census Transform window for disparity features.
      int32_t maxDisparity
      Maximum disparity for matching search.
      VPIStatus vpiInitStereoDisparityEstimatorCreationParams(VPIStereoDisparityEstimatorCreationParams *params)
      Initializes VPIStereoDisparityEstimatorCreationParams with default values.
      VPIStatus vpiCreateStereoDisparityEstimator(uint32_t backends, int32_t imageWidth, int32_t imageHeight, VPIImageFormat inputFormat, const VPIStereoDisparityEstimatorCreationParams *params, VPIPayload *payload)
      Creates payload for vpiSubmitStereoDisparityEstimator.
      Structure that defines the parameters for vpiCreateStereoDisparityEstimator.
      Structure that defines the parameters for vpiSubmitStereoDisparityEstimator.
    6. Create the image rectification payload and corresponding parameters. It does lens distortion correction using the Remap algorithm. Here the stereo lens parameters are specified. Because they are different for left and right lenses, two remap payloads are created. For more details, see Lens Distortion Correction.

      memset(&dist, 0, sizeof(dist));
      dist.k1 = -0.126;
      dist.k2 = 0.004;
      const VPICameraIntrinsic Kleft =
      {
      {466.5, 0, 321.2},
      {0, 466.5, 239.5}
      };
      const VPICameraIntrinsic Kright =
      {
      {466.2, 0, 320.3},
      {0, 466.2, 239.9}
      };
      {
      {1, 0.0008, -0.0095, 0},
      {-0.0007, 1, 0.0038, 0},
      {0.0095, -0.0038, 0.9999, 0}
      };
      memset(&map, 0, sizeof(map));
      map.grid.regionWidth[0] = 640;
      map.grid.regionHeight[0] = 480;
      map.grid.horizInterval[0] = 4;
      map.grid.vertInterval[0] = 4;
      VPIPayload ldc_left;
      vpiCreateRemap(VPI_BACKEND_VIC, &map, &ldc_left);
      VPIPayload ldc_right;
      vpiCreateRemap(VPI_BACKEND_VIC, &map, &ldc_right);
      VPIStatus vpiWarpMapGenerateFromPolynomialLensDistortionModel(const VPICameraIntrinsic Kin, const VPICameraExtrinsic X, const VPICameraIntrinsic Kout, const VPIPolynomialLensDistortionModel *distModel, VPIWarpMap *warpMap)
      Generates a mapping that corrects image using polynomial lens distortion model.
      float VPICameraExtrinsic[3][4]
      Camera extrinsic matrix.
      Definition: Types.h:399
      float VPICameraIntrinsic[2][3]
      Camera intrinsic matrix.
      Definition: Types.h:386
      Holds coefficients for polynomial lens distortion model.
      VPIStatus vpiCreateRemap(uint32_t backends, const VPIWarpMap *warpMap, VPIPayload *payload)
      Create a payload for Remap algorithm.
      @ VPI_BACKEND_VIC
      VIC backend.
      Definition: Types.h:94
      int8_t numHorizRegions
      Number of regions horizontally.
      Definition: WarpGrid.h:158
      VPIWarpGrid grid
      Warp grid control point structure definition.
      Definition: WarpMap.h:91
      int16_t horizInterval[VPI_WARPGRID_MAX_HORIZ_REGIONS_COUNT]
      Horizontal spacing between control points within a given region.
      Definition: WarpGrid.h:163
      int8_t numVertRegions
      Number of regions vertically.
      Definition: WarpGrid.h:159
      int16_t vertInterval[VPI_WARPGRID_MAX_VERT_REGIONS_COUNT]
      Vertical spacing between control points within a given region.
      Definition: WarpGrid.h:165
      int16_t regionWidth[VPI_WARPGRID_MAX_HORIZ_REGIONS_COUNT]
      Width of each region.
      Definition: WarpGrid.h:161
      int16_t regionHeight[VPI_WARPGRID_MAX_VERT_REGIONS_COUNT]
      Height of each region.
      Definition: WarpGrid.h:162
      VPIStatus vpiWarpMapAllocData(VPIWarpMap *warpMap)
      Allocates the warp map's control point array for a given warp grid.
      Defines the mapping between input and output images' pixels.
      Definition: WarpMap.h:88
    7. Create output buffers for Harris keypoint detector.

      This algorithm receives an image and outputs two arrays, one with the keypoints themselves and another with the score of each keypoint. A maximum of 8192 keypoints are returned; this must be the array capacity. Keypoints are represented by the VPIKeypoint structure and scores by 32-bit unsigned values. For more information, see Harris Corner Detector.

      VPIArray keypoints, scores;
      vpiArrayCreate(8192, VPI_ARRAY_TYPE_KEYPOINT, 0, &keypoints);
      vpiArrayCreate(8192, VPI_ARRAY_TYPE_U32, 0, &scores);
      VPIStatus vpiArrayCreate(int32_t capacity, VPIArrayType type, uint32_t flags, VPIArray *array)
      Create an empty array instance.
      struct VPIArrayImpl * VPIArray
      A handle to an array.
      Definition: Types.h:173
      @ VPI_ARRAY_TYPE_U32
      unsigned 32-bit.
      Definition: ArrayType.h:73
      @ VPI_ARRAY_TYPE_KEYPOINT
      VPIKeypoint element.
      Definition: ArrayType.h:74
    8. Define Harris detector parameters and create the detector's payload.

      Fill the VPIHarrisCornerDetectorParams structure with the required parameters. See the structure documentation for more information about each parameter.

      Like stereo disparity, the Harris detector requires a payload. This time only the input size (640x480) is needed. When the pipeline users this payload, it only accepts inputs of this size.

      harris_params.gradientSize = 5;
      harris_params.blockSize = 5;
      harris_params.strengthThresh = 10;
      harris_params.sensitivity = 0.4f;
      harris_params.minNMSDistance = 8;
      VPIPayload harris;
      int32_t gradientSize
      Gradient window size.
      Definition: HarrisCorners.h:82
      int32_t blockSize
      Block window size used to compute the Harris Corner score.
      Definition: HarrisCorners.h:85
      float strengthThresh
      Specifies the minimum threshold with which to eliminate Harris Corner scores.
      Definition: HarrisCorners.h:88
      float minNMSDistance
      Non-maximum suppression radius, set to 0 to disable it.
      Definition: HarrisCorners.h:94
      float sensitivity
      Specifies sensitivity threshold from the Harris-Stephens equation.
      Definition: HarrisCorners.h:91
      VPIStatus vpiCreateHarrisCornerDetector(uint32_t backends, int32_t inputWidth, int32_t inputHeight, VPIPayload *payload)
      Creates a Harris Corner Detector payload.
      Structure that defines the parameters for vpiSubmitHarrisCornerDetector.
      Definition: HarrisCorners.h:80
      @ VPI_BACKEND_CPU
      CPU backend.
      Definition: Types.h:91
    9. Create events to implement a barrier synchronization.

      Events are used for inter-stream synchronization. They are implemented with VPIEvent. The pipeline needs two barriers: one to wait for the input to Harris corner extraction to be ready, and another for the preprocessed right image.

      VPIEvent barrier_right_grayscale, barrier_right_reduced;
      vpiEventCreate(0, &barrier_right_grayscale);
      vpiEventCreate(0, &barrier_right_reduced);
      struct VPIEventImpl * VPIEvent
      A handle to an event.
      Definition: Types.h:185
      VPIStatus vpiEventCreate(uint32_t flags, VPIEvent *event)
      Create an event instance with the specified flags.
  3. After initialization comes the main processing phase, which implements the pipeline by submitting algorithms and events to the streams in the correct order. The pipeline's main loop can do this many times using the same events, payloads, temporary buffers, and output buffers. The input is usually redefined for each iteration, as shown below.
    1. Submit the left frame processing stages.

      Lens distortion correction, image format conversion and downscaling are submitted to the left stream. Note again that the submit operations are non-blocking and return immediately.

      vpiSubmitRemap(stream_left, VPI_BACKEND_VIC, ldc_left, left, left_rectified, VPI_INTERP_CATMULL_ROM,
      vpiSubmitConvertImageFormat(stream_left, VPI_BACKEND_CUDA, left_rectified, left_grayscale, NULL);
      vpiSubmitRescale(stream_left, VPI_BACKEND_VIC, left_grayscale, left_reduced, VPI_INTERP_LINEAR, VPI_BORDER_CLAMP,
      0);
      VPIStatus vpiSubmitConvertImageFormat(VPIStream stream, uint32_t backend, VPIImage input, VPIImage output, const VPIConvertImageFormatParams *params)
      Converts the image contents to the desired format, with optional scaling and offset.
      VPIStatus vpiSubmitRemap(VPIStream stream, uint32_t backend, VPIPayload payload, VPIImage input, VPIImage output, VPIInterpolationType interp, VPIBorderExtension border, uint32_t flags)
      Submits the Remap operation to the stream associated with the payload.
      VPIStatus vpiSubmitRescale(VPIStream stream, uint32_t backend, VPIImage input, VPIImage output, VPIInterpolationType interpolationType, VPIBorderExtension border, uint32_t flags)
      Changes the size and scale of a 2D image.
      @ VPI_INTERP_LINEAR
      Linear interpolation.
      Definition: Interpolation.h:93
      @ VPI_INTERP_CATMULL_ROM
      Catmull-Rom cubic interpolation.
    2. Submit the first few stages of the right frame pre-processing.

      The lens distortion correction and image format conversion stages will result in a grayscale image for input to Harris corner extraction.

      vpiSubmitRemap(stream_right, VPI_BACKEND_VIC, ldc_right, right, right_rectified, VPI_INTERP_CATMULL_ROM,
      vpiSubmitConvertImageFormat(stream_right, VPI_BACKEND_CUDA, right_rectified, right_grayscale, NULL);
    3. Record the right stream state so that the keypoints stream can synchronize to it.

      The keypoint stream can only start when its input is ready. First, the barrier_right_grayscale event must record the right stream state by submitting a task to it that will signal the event when format conversion completes.

      vpiEventRecord(barrier_right_grayscale, stream_right);
      VPIStatus vpiEventRecord(VPIEvent event, VPIStream stream)
      Captures in the event the contents of the stream command queue at the time of this call.
    4. Finish the right frame preprocessing with a downscale operation.

      vpiSubmitRescale(stream_right, VPI_BACKEND_VIC, right_grayscale, right_reduced, VPI_INTERP_LINEAR, VPI_BORDER_CLAMP,
      0);
    5. Record the right stream state so that the left stream can synchronize to it.

      With the whole of right preprocessing submitted, the stream state must be recorded again so that the left stream can wait until the right frame is ready.

      vpiEventRecord(barrier_right_reduced, stream_right);
    6. Make the left stream wait until the right frame is ready.

      Stereo disparity requires the left and right frames to be ready. The pipeline uses vpiStreamWaitEvent to submit a task to the left stream that will wait until the barrier_right_reduced event is signaled on the right stream, meaning that right frame preprocessing is finished.

      vpiStreamWaitEvent(stream_keypoints, barrier_right_grayscale);
      VPIStatus vpiStreamWaitEvent(VPIStream stream, VPIEvent event)
      Pushes a command that blocks the processing of all future commands submitted to the stream until the ...
    7. Submit the stereo disparity algorithm.

      The input images are now ready. Call vpiSubmitStereoDisparityEstimator to submit the disparty estimator.

      vpiSubmitStereoDisparityEstimator(stream_left, VPI_BACKEND_CUDA, stereo, left_reduced, right_reduced, disparity,
      NULL, &stereo_params);
      VPIStatus vpiSubmitStereoDisparityEstimator(VPIStream stream, uint32_t backend, VPIPayload payload, VPIImage left, VPIImage right, VPIImage disparity, VPIImage confidenceMap, const VPIStereoDisparityEstimatorParams *params)
      Runs stereo processing on a pair of images and outputs a disparity map.
    8. Submit the keypoint detector pipeline.

      For keypoint detection, first submit a wait operation on the barrier_right_grayscale event to make the pipeline wiat until the input is ready. Then submit the Harris corners detector on it.

      vpiSubmitHarrisCornerDetector(stream_keypoints, VPI_BACKEND_CPU, harris, right_grayscale, keypoints, scores,
      &harris_params);
      VPIStatus vpiSubmitHarrisCornerDetector(VPIStream stream, uint32_t backend, VPIPayload payload, VPIImage input, VPIArray outFeatures, VPIArray outScores, const VPIHarrisCornerDetectorParams *params)
      Submits Harris Corner Detector operation to the stream associated with the payload.
    9. Synchronize the streams to use the disparity map and keypoints detected.

      Remember that the functions called so far in processing phase are all asynchronous; they return immediately once the job is queued on the stream for later execution.

      More processing can now be performed on the main thread, such as updating GUI status information or displaying the previous frame. This occurs while VPI is executing the pipeline. Once this additional processing is performed, the streams that process the final result from current frame must be synchronized using vpiStreamSync. Then the resulting buffers can be accessed.

      vpiStreamSync(stream_left);
      vpiStreamSync(stream_keypoints);
    10. Fetch the next frame and update the input wrappers.

      The existing input VPI image wrappers can be redefined to wrap the next two stereo pair frames, provided that their dimensions and format are the same. This operation is quite efficient, as it is done without heap memory allocations.

      eglLeftFrame = /* Fetch next frame from left camera */;
      eglRightFrame = /* Fetch next from right camera */;
      vpiImageSetWrappedEGLImage(left, eglLeftFrame);
      vpiImageSetWrappedEGLImage(right, eglRightFrame);
      VPIStatus vpiImageSetWrappedEGLImage(VPIImage img, EGLImageKHR eglImage)
      Redefines the wrapped EGLImage of an existing VPIImage wrapper.
  4. Destroy the context.

    This example has created many objects under the current context. Once all processing is completed and the pipeline is no longer needed, destroy the context. All streams are then synchronized and destroyed, along with all other objects used. No memory leaks are possible.

    Destroying the current context reactivates the context that was active before the current one became active.

    return 0;
    }
    void vpiContextDestroy(VPIContext ctx)
    Destroy a context instance as well as all resources it owns.

Important takeaways from these examples:

  • Algorithm submission returns immediately.
  • Algorithm execution occurs asynchronously with respect to the host thread.
  • Buffers can be used by different streams, although race conditions must be avoided through the use of events.
  • A context owns all objects created by one user thread while it is in an active state. This allows some interesting scenarios where one thread sets up the context and triggers all the processing pipeline, then moves the whole context to another thread that waits for the pipeline to end, then triggers further data processing.