VPI - Vision Programming Interface

2.0 Release

Architecture

Overview

VPI is a software library that provides a collection of computer vision and image processing algorithms that can be seamlessly executed in a variety of hardware accelerators. These accelerators are called backends.

The goal of VPI is to provide a uniform interface to the computing backends while maintaining high performance. It achieves this by exposing a thin, but effective, software abstraction of the underlying hardware and the data it manipulates.

This diagram illustrates the architecture of VPI:

The API follows a paradigm in which object allocation and setup take place in an initialization phase. Following is the application loop, where the main processing occurs, using the objects created during initialization. When main processing is complete, the created objects are destroyed and the environment is cleaned up. In resource-constrained embedded environments, where memory allocations are limited in both time and space, the control over memory allocation and lifetime that VPI provides is beneficial.

The core components of VPI include:

  • Algorithms: Represent undivisible compute operations.
  • Backends: Represent hardware engines responsible for actual computation.
  • Streams: Act as asynchronous queues to which algorithms are submitted, ultimately for sequential execution on a given backend. Streams and events are building blocks of computing pipelines.
  • Buffers: Store input and output data.
  • Events: Provide synchronization primitives that operate on streams and/or the application thread.
  • Contexts: Hold the state of VPI and created objects.

Supported Platforms

VPI can be used in the following platforms/devices:

  • Jetson AGX Xavier, Jetson AGX Xavier NX
  • Jetson AGX Orin
  • Linux x86_64 with NVIDIA dGPUs starting from Maxwell (sm_50 or newer).
    • Tested with Ubuntu 18.04 and Ubuntu 20.04

Algorithms

Algorithms represent actual computing operations. They act on one or more input buffers and write their results to output buffers provided by the user. They run asynchronously with respect to the application thread. For a list of supported algorithms, see the Algorithms section.

There are two classes of algorithms:

  • Algorithms that require a payload.
  • Payload-less algorithms.

Algorithm Payload

Some algorithm implementations, such as FFT or KLT Feature Tracker, require temporary resources to function properly. These resources are encapsulated by a VPIPayload object associated with the algorithm.

Before you can execute an algorithm you must create the corresponding payload at initialization time, passing parameters that are used to allocate temporary resources and to specify the backends which may execute the algorithm. In the main loop, where computation is performed, you submit an algorithm instance to a stream for execution, providing the corresponding payload along with input and output parameters. You can resuse the payload in multiple algorithm instances, but you must ensure that the payload is used by only one instance at a time.

When a payload is no longer needed, you must destroy it by calling vpiPayloadDestroy. This function deallocates any resources encapsulated in the payload.

Examples:

  • FFT payload creation to be used by CUDA backend, to be done only once.
    #define VPI_IMAGE_FORMAT_2F32
    Single plane with two interleaved 32-bit floating point channels.
    Definition: ImageFormat.h:136
    #define VPI_IMAGE_FORMAT_F32
    Single plane with one 32-bit floating point channel.
    Definition: ImageFormat.h:130
    VPIStatus vpiCreateFFT(uint64_t backends, int32_t inputWidth, int32_t inputHeight, const VPIImageFormat inFormat, const VPIImageFormat outFormat, VPIPayload *payload)
    Creates payload for direct Fast Fourier Transform algorithm.
    struct VPIPayloadImpl * VPIPayload
    A handle to an algorithm payload.
    Definition: Types.h:268
    @ VPI_BACKEND_CUDA
    CUDA backend.
    Definition: Types.h:93
  • FFT algorithm submission to a stream, called as many times as necessary, possibly with different inputs and outputs.
    vpiSubmitFFT(stream, VPI_BACKEND_CUDA, fft, inputF32, spectrum, 0);
    VPIStatus vpiSubmitFFT(VPIStream stream, uint64_t backend, VPIPayload payload, VPIImage input, VPIImage output, uint64_t flags)
    Runs the direct Fast Fourier Transform on single image.
  • Payload destruction, to be done when processing is complete and the payload is no longer needed.
    void vpiPayloadDestroy(VPIPayload payload)
    Deallocates the payload object and all associated resources.

Payload-Less Algorithms

Some algorithms do not require temporary resources. Such algorithms include Box Filter and Rescale, among others. For these algorithms no payload handling is necessary, and the sequence of operations is simplified. All the required data is sent during algorithm submission.

Example:

  • Box filter algorithm submission to CUDA backend, to be called as many times as necessary.
    vpiSubmitBoxFilter(stream, VPI_BACKEND_CUDA, input, output, 5, 5, VPI_BORDER_ZERO);
    VPIStatus vpiSubmitBoxFilter(VPIStream stream, uint64_t backend, VPIImage input, VPIImage output, int32_t kernelWidth, int32_t kernelHeight, VPIBorderExtension border)
    Runs a 2D box filter over an image.
    @ VPI_BORDER_ZERO
    All pixels outside the image are considered to be zero.
    Definition: Types.h:278

Backends

Every algorithm supported by VPI is implemented in one or more backends. Different implementations of the same algorithm return similar results when given the same inputs, but small variations between their results may occur. This is mostly due to optimizations tailored to a particular backend, such as use of fixed-point instead of floating-point arithmetic.

CPU

This backend represents the device's CPU. It may create a set of background worker threads and data structures supporting efficient parallel execution across multiple cores. These worker threads might be shared among different streams and/or context instances.

VPI provides mechanisms which allow you to define your own CPU task scheduling scheme by calling vpiContextSetParallelFor with a VPIParallelForCallback function that VPI is to call when CPU tasks need to be executed.

CUDA

The CUDA backend has an explicit affinity with a particular CUDA-enabled GPU, defined during construction of a stream. This means that algorithms submitted for execution on this stream are handled by this GPU.

The CUDA backend manages a cudaStream_t handle and other CUDA device information that allow it to launch the underlying CUDA kernels.

VPI takes advantage of the asynchronous nature of CUDA kernel launches to optimize their launching. In some situations, especially when no user-defined functions have been submitted to the stream, the backend launches a CUDA task directly from the caller thread, bypassing the worker thread entirely.

When only CUDA algorithms are involved, VPI generally acts as an efficient thin layer on top of the CUDA SDK.

You must set up the CUDA context for the calling thread properly before you construct the API context. The resulting context object uses the corresponding CUDA context for internal kernel calls.

Note
Use of multiple GPUs is not currently supported, and may cause undefined behavior. Rsults obtained using multiple GPUs are not reliable.

PVA

The Programmable Vision Accelerator (PVA) is a processor in NVIDIA® Jetson AGX Xavier™ and NVIDIA® Jetson Xavier™ NX devices that is specialized for image processing and computer vision algorithms.

Use the PVA backend when you need to leave the GPU free to run other tasks that only it can perform, such as deep learning inference stages and algorithms only implemented on CUDA backend.

PVA hardware is much more power-efficient than CPU and CUDA hardware. Therefore, use the PVA backend where possible if power is at a premium.

Each Jetson AGX Xavier or Jetson Xavier NX device comprises two PVA processors, each one contains two vector processors. Therefore, the device can execute at most four independent PVA tasks concurrently.

When multiple VPI streams have the PVA backend enabled, they each choose one available PVA vector processor in round-robin succession.

Note
A PVA backend is not necessarily faster than a CUDA or CPU backend for any particular algorithm.

VIC

The Video Image Compositor (VIC) is a fixed-functionality processor in Jetson devices that is specialized for low-level image processing tasks, such as rescaling, color space conversion, noise reduction, and compositing.

Like a PVA backend, a VIC backend allows you to offload tasks from the GPU, leaving it free for other processing, if performance is not at a premium.

NVENC

The NVIDIA Encoder Engine (NVENC) is a processor in Jetson devices that is dedicated to video encoding. Some stages of the encoding process can be repurposed to other tasks, such as Dense Optical Flow.

OFA

The NVIDIA Optical Flow Accelerator (OFA) is a specifialized processor in the new Jetson AGX Orin devices for calculating the optical flow between images. It's currently being used as a backend in Stereo Disparity Estimator.

Streams

The VPIStream object is the main entry point to the API. It is loosely based on CUDA's cudaStream_t. This object represents a FIFO command queue which stores a list of commands to be executed by some backend. The commands might run a particular computer vision algorithm, perform a host function (using vpiSubmitHostFunction), or signal an event.

At initialization, a stream is configured to use the backends that are to execute the tasks submitted to it. By default, it uses the backends enabled by the current context. When you create a stream you can set flags to further limit the number of available backends and reduce resource usage.

Each stream launches an internal worker thread for dispatching tasks, allowing asynchronous task execution with respect to calling (user) thread. This means that when an algorithm submission call is invoked to a VPI stream on a particular backend, the function pushes a corresponding command to the VPIStream worker thread and immediately returns the execution to the calling thread.

Tasks pushed to a worker thread aren't processed immediately. They are initially gathered in an staging queue. These tasks are only processed when the stream is flushed. This operations moves all tasks from the staging queue into a processing queue which eventually submits the tasks to the backends associated with them.

The following events trigger a stream flush:

The staging queue allows for processing pipeline optimization opportunities, such as minimization of memory mapping operations, among others.

For more information, see Stream in the "C API Reference" section of VPI - Vision Programming Interface.

Buffers

Buffers represent the data that VPI algorithms work with. VPI supports abstractions for three kinds of data:

  • Image: Holds 2-dimensional data.
  • Array: Holds 1-dimensional data.
  • Pyramid: Holds an array of images with varying amounts of detail, from fine to coarse.

VPI can allocate all three types of buffers. For images and arrays, it can wrap data into a VPI buffer and store it in pre-allocated memory. This is useful when an application requires interoperability with libraries other than VPI, as when it uses an OpenCV cv::Mat buffer as input to a VPI algorithm.

All buffer types share the attributes of size and element type.

Images

VPI images represent any kind of 2D data, such as actual images, vector fields embedded in a 2D space, and 2D heat maps.

VPI images are characterized by their size (width and height) and format.

When an application creates a VPIImage object, it passes flags that specify which backend the image can work with. You can set the flags with one of the VPIBackend enums, or with two or more enums OR'ed together. When no backend flags are passed, VPI enables all of the backends allowed by the current context, which by default are all available backends.

For more information, see Image in the "C API Reference" section of VPI - Vision Programming Interface.

Image views

VPI Image views represent a rectangular region in an existing 2D data of a Image, please see a description of Images above.

VPI image views are created from an existing VPI image and are characterized by their clip region, i.e. a rectangle defined by a start position (x, y) and a size (width, height). They share the same context and format of the original source image, but their size (width and height) is the size of the rectangular region.

When an application creates a VPIImage object to be an image view, it passes flags that specify the backend the same way it does with regular images.

For more information, refer to the reference documentation of vpiImageCreateView and vpiImageSetView functions.

Locking

To make image contents available for access outside VPI, the image must lock the image buffer by calling the vpiImageLockData function. This ensures that all changes made to the memory are committed and made available for access outside VPI.

Depending on the buffer type that is being made available, the image must have certain backends enabled, see vpiImageLockData documentation for more details. vpiImageLockData fills the VPIImageData object with image information that allows you to address and interpret all image pixels properly. When you are done working on the image data on the host, call vpiImageUnlock.

Images must also be locked when they wrap buffers allocated outside VPI and these buffers are being accessed externally to VPI. In this case, you aren't interested in retrieving the image contents via VPI calls. Instead, call vpiImageLock to lock the image contents. Only then can they be accessed directly via the wrapped buffer. Call vpiImageUnlock to let VPI access the buffers when this is done. Again, while the buffer is locked, streams trying to access its contents will fail with VPI_ERROR_BUFFER_LOCKED.

When an image is locked, it cannot be accessed by an algorithm running asychronously. It can, however, be locked recursively by the same thread that locked it initially. Remember to pair each vpiImageLockData call with a corresponding vpiImageUnlock.

Image Formats

VPI supports a variety of image formats representing different pixel types, such as single-channel 8, 16, or 32-bit, unsigned and signed, multi-channel RGB and RGBA, semi-planar NV12, etc.

The image format is represented by the VPIImageFormat enum. Each format is defined by several attributes, such as color space, number of planes, and data layout. There are functions to extract each component from the image format, as well as to modify an existing one.

Not all algorithms support all recognized image formats. Most offer a choice of several formats, though. The supported formats are listed in each algorithm's API reference documentation.

2D images are most commonly laid out in memory in pitch-linear format, i.e. row by row, one after another. Each row can be larger than necessary to hold the image's data to conform with row address alignment restrictions.

You can also create or wrap memory using a proprietary block-linear layout. For some algorithms and backends it can be more efficient to create 2D memories using this format.

For more information, see Image Formats in the "C API Reference" section of VPI - Vision Programming Interface.

Wrapping External Memory

You can create images that wrap externally allocated memory using the function vpiImageCreateWrapper. In each case, you must fill a VPIImageData structure with the required information and pass it to the function. Please see its API reference documentation for information on the memory types that can be wrapped.

In all of these cases, the VPIImage object does not own the memory buffer. When the VPIImage is destroyed, the buffer is not deallocated.

Like the function for creating image buffers managed by VPI, these wrapping functions accept flags that specify which backends they can be used with.

Arrays

VPI arrays represent 1D data, such as keypoint lists, bounding boxes, and transforms.

Arrays are characterized by their capacity, size, and element type. As with images, the flags are used to specify which backends they can work with.

Array types are drawn from enum VPIArrayType. Algorithms that require arrays for input or output, such as the KLT template tracker, usually accept one specific array type.

VPIArray's behavior is slightly different from other memory buffers: while the capacity of an array is fixed for the lifetime of the object, its size can change. Any API that writes to an array must set the size parameter to the number of valid elements in the array. You can use vpiArrayGetSize and vpiArraySetSize to query and modify the size of an array.

For more information, see Array in the "C API Reference" section of VPI - Vision Programming Interface.

Locking

Array data can be accessed outside VPI using the vpiArrayLockData function. This function works like its image counterpart. It too supports recursive locking by the same thread.

Wrapping External Memory

You can also create arrays that wrap externally allocated CUDA and host memory using the functios vpiArrayCreateWrapper. In both cases, you must fill a VPIArrayData structure with the required information and pass it to the function.

Pyramids

VPI pyramids represent a collection of VPI images stacked together, all with the same format, but possibly with different dimensions.

A pyramid is characterized by its number of levels, base level dimensions, scale factor, and image format. The scale factor represents the ratio of one level's dimension over the prior level's dimension. For instance, when scale=0.5, the pyramid is dyadic, i.e., dimensions are power-of-two.

It is often necessary to process one pyramid level as the input or output of a VPI algorithm. Then you must use vpiImageCreateWrapperPyramidLevel to identify the pyramid and its level to be wrapped. The resulting image inherits the pyramid's enabled backends. You can use the returned VPIImage handle like any other image. When you are done using the image, you must destroy it with vpiImageDestroy.

For more information, see the Pyramid in the "C API Reference" section of VPI - Vision Programming Interface.

Locking

As with images and arrays, you can access a whole pyramid's contents outside VPI using the function vpiPyramidLockData, provided that the pyramid has enabled the backend corresponding to the returned buffer type. See vpiPyramidLockData for more information. This function fills a VPIPyramidData structure that contains an array of VPIImageData. When you are done using the VPIPyramidData, call vpiPyramidUnlock to unmap the pyramid from the host and free its resources.

Recursive locking works for pyramids just as images and arrays.

Events

Each compute function in the API is executed asynchronously with respect to the calling thread; that is, it returns immediately rather than waiting for the operation to complete. There are two ways to synchronize the operation with the backend.

One method is to wait until all of the commands in the VPIStream queue are finished by calling vpiStreamSync. This method is simple, but it can't provide synchronization that is fine-grained (e.g., "wait until function X is completed") or inter-stream (e.g., "wait until function C in stream D completes before running function A in stream B").

The other method provides more flexible synchronization by using VPIEvent objects. These objects are conceptually like binary semaphores, and are designed to mimic events in CUDA API closely:

  • You can capture all commands submitted to a VPIStream instance in an event instance (see vpiEventRecord). The event is signaled when all captured commands have been processed and removed from the VPIStream command queue.
  • You can perform inter-stream synchronization with the vpiStreamWaitEvent call, which pushes a command to the VPIStream queue that blocks processing of future queued commands until the given event is signaled.
  • The application can query the event's state with vpiEventQuery.
  • Application threads can block until the event is completed with vpiEventSync.
  • Events can be timestamped when completed.
  • You can compute the difference between timestamps on completed events in the same stream as well as between different streams.

For more information, see Event in the "C API Reference" section of VPI - Vision Programming Interface.

Contexts

A context encapsulates all resources used by VPI to perform operations. It automatically cleans up these resources when the context is destroyed.

Every application CPU thread has an active context. Each context owns the VPI objects created while it is active.

By default, all application threads are associated with the same global context, which is created automatically by VPI when the first VPI resource is created. You do not need to perform any explicit context management in this case, everything is handled by VPI under the hood.

When finer control of contexts is needed, user-created contexts are an option. Once created, a context can be pushed to the current application thread's context stack, or can replace the current context. Both actions make the created context active. Refer to Context Stack for more information on how to manipulate contexts.

You can specify several properties associated with a context when you create it, such as which backends are supported by created objects when the context is active. This effectively allows you to mask support for a particular backend. For example, stream creation for a CUDA backend fails if the current context doesn't have the VPI_BACKEND_CUDA flag set. If you don't pass backend flags, the context inspects the running platform and enables the backends associated with all available hardware engines.

Note
The CPU backend cannot be masked out, and must always be supported as a fallback implementation.

Objects (buffers, payloads, events, etc.) cannot be shared among different contexts.

There is no limit to the number of created contexts except available memory.

For more information, see Context in the "C API Reference" section of VPI - Vision Programming Interface.

Global Context

By default, VPI creates a single global context before it creates any VPI objects. This global context is initially shared among all application threads, and cannot be destroyed by the user.

For most use cases, an application can use the global context. When an application requires finer control of how objects are grouped together, or it needs a degree of independence between pipelines, you may want to create and manipulate contexts explicitly.

Context Stack

Each application thread has a context stack not shared with other threads.

The top context in the stack is the current context for that thread.

By default, the context stack has one context in it, the global context. Consequently, all new threads have the same global context set as the current thread.

Setting a context current in a given stack amounts to replacing the top context, either the global context or the most recently pushed context, with the given context. The replaced context does not belong to the stack anymore.

However, pushing a context into a stack does not replace anything. The top context is kept in the stack and the new pushed context is put at the top, thereby becoming the new current context.

The user can push and pop contexts from the stack at will. This allows for temporarily creating pipelines in a new context without disturbing the existing context.

To avoid leakage, it is important to match the number of pushes and pops on a given context stack. Be aware that the context stack can have at most eight contexts in it.

Thread Safety

All API functions are thread-safe. Concurrent host access to API objects is serialized and executed in an unspecified order. All API calls use a VPIContext instance that is thread-specific and is stored in Thread Local Storage (TLS). If the context pointer for the current thread is NULL (no context is set), all API calls use the default global context created during library initialization.

API objects have no concept of thread affinity; that is, if several threads use the same context instance, an object created in one thread can safely be destroyed by another thread.

Most of the API functions are non-blocking. The functions that can block when called are vpiStreamSync, vpiStreamDestroy, vpiContextDestroy, vpiEventSync and the several vpiSubmit* functions (which block when the stream command queue is full). Since implicit synchronization in the API implementation is minimal, you must ensure that the resulting order of dependent function calls is legal.

Pipeline examples, and how to implement them using VPI, are explained in the following sections.

Simple Pipeline

In this example, a pipeline with a simple box filter operation is implemented to process an input image. This is quite similar to the Image Blurring tutorial.

The code for implementing the pipeline is as follows:

Language:
  1. Import the vpi module.

    import vpi
  2. Create the input image buffer to be used

    The example creates a 640x480 1-channel (grayscale) input image with unsigned 8-bit pixel elements, which are represented by vpi.Format.U8. VPI initializes images with zeros upon creation.

    Note
    This example creates an empty input image buffer, but in a real use case an existing memory buffer could be wrapped into a VPI image buffer, or an image from an earlier pipeline stage could be used. See the Image Blurring tutorial for a more complete example.
    input = vpi.Image((640,480), vpi.Format.U8)
  3. Within a Python context that defines vpi.Backend.CUDA as the default backend, call the box_filter method on the input image. The 3x3 box filter algorithm will be executed by the CUDA backend on the default stream. The result will be returned into a new image output.

    with vpi.Backend.CUDA:
    output = input.box_filter(3)
Note
For simplicity, this example does not check function return values for errors. Consult the bundled samples for examples of simple but complete applications.
  1. Include the necessary headers. This example needs headers for image buffers, a stream, and the Box Filter algorithm.

    #include <vpi/Image.h>
    #include <vpi/Stream.h>
    Declares functions that implement the Box Filter algorithm.
    Functions and structures for dealing with VPI images.
    Declares functions dealing with VPI streams.
  2. Create the image buffers to be used.

    int main()
    {
    VPIImage input, output;
    vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U8, 0, &input);
    vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U8, 0, &output);
    #define VPI_IMAGE_FORMAT_U8
    Single plane with one 8-bit unsigned integer channel.
    Definition: ImageFormat.h:100
    struct VPIImageImpl * VPIImage
    A handle to an image.
    Definition: Types.h:256
    VPIStatus vpiImageCreate(int32_t width, int32_t height, VPIImageFormat fmt, uint64_t flags, VPIImage *img)
    Create an empty image instance with the specified flags.

    The example creates a 640x480 1-channel (grayscale) input image with unsigned 8-bit pixel elements, which are represented by VPI_IMAGE_FORMAT_U8 enum. VPI initializes images with zeros upon creation. Pass all-zero image flags to indicate that this image may be used in all available hardware backends. This makes it easier to submit algorithms to different backends later on, at the cost of using more resources. The output image is created the same way.

    Note
    This example creates an empty input image buffer, but in a real use case an existing memory buffer could be wrapped into a VPI image buffer, or an image from an earlier pipeline stage could be used. See the Image Blurring tutorial for a more complete example.
  3. Create a stream to execute the algorithm. Pass all-zero stream flags to indicate that the algorithm may be executed in any available hardware backend, to be specified later.

    VPIStream stream;
    vpiStreamCreate(0, &stream);
    struct VPIStreamImpl * VPIStream
    A handle to a stream.
    Definition: Types.h:250
    VPIStatus vpiStreamCreate(uint64_t flags, VPIStream *stream)
    Create a stream instance.
  4. Submit the box filter algorithm to the stream, along with the input and output images and other parameters. In this case, the filter algorithm is a 3x3 box filter with clamp boundary condition. It is to be executed by the CUDA backend.

    vpiSubmitBoxFilter(stream, VPI_BACKEND_CUDA, input, output, 3, 3, VPI_BORDER_CLAMP);
    @ VPI_BORDER_CLAMP
    Border pixels are repeated indefinitely.
    Definition: Types.h:279

    In general, because of the asynchronous nature of streams, the algorithm is enqueued on the stream's work thread, and the function returns immediately. Later it is submitted for execution in the backend. Using a work thread allows the program to continue assembling the processing pipeline, or do some other task, while the algorithm executes in parallel.

  5. Wait until the stream finishes processing.

    vpiStreamSync(stream);
    VPIStatus vpiStreamSync(VPIStream stream)
    Blocks the calling thread until all submitted commands in this stream queue are done (queue is empty)...

    This function blocks until all algorithms submitted to the stream finish executing. The pipeline must do this before it can display the output, or save it to disk, etc.

  6. Destroy created objects.

    vpiImageDestroy(output);
    return 0;
    }
    void vpiImageDestroy(VPIImage img)
    Destroy an image instance.
    void vpiStreamDestroy(VPIStream stream)
    Destroy a stream instance and deallocate all HW resources.

    When the pipeline finishes using the created objects, it destroys them to prevent memory leaks. Destroying a stream forces it to synchronize, but destroying an image that is still being used by an algorithm leads to undefined behavior, most likely resulting in a program crash.

In this example, NVIDIA recommends examining how several VPI objects work together, and inspecting the ownership relationship between objects.

This is a conceptual structure of the provided C/C++ example:

Where:

  • Default Context is the context that is automatically created and made active. In this example, the default context is the stream and the image buffers.
  • Stream owns a worker thread that queues and dispatches tasks to the backend devices and handles synchronization. It also owns objects that represent the hardware backends where algorithms are eventually executed.
  • Box Filter is the algorithm that is submitted to the stream. Internally Job 1 is created with the algorithm kernel and all its parameters. It is then enqueued on the work thread, which submits it to the hardware when all previous tasks submitted to it are completed. Since this algorithm doesn't have a payload (or state), there are no concerns about its lifetime.
  • Sync represents the vpiStreamSync call. It enqueues Job 2 onto the work thread, and the job signals an internal event when it is executed. The calling thread waits until the event is signaled, guaranteeing that all tasks queued so far have finished. Submissions by other threads are blocked until vpiStreamSync returns.
Note
In this example, since the work thread is empty when the algorithm is submitted and CUDA kernel executions are asynchronous, VPI submits the algorithm directly to the CUDA device, bypassing the work thread altogether.
When only the CUDA backend is used for algorithm submission and synchronization, VPI overhead on top of the underlying CUDA execution is generally minimized, and is often negligible. The same is true of streams that use only one of the other backends. Submitting algorithms to different backends in the same stream incurs a small internal synchronization overhead.

Complex Pipeline

More complex scenarios may take advantage of different acceleration processors on the device and create a pipeline that best utilizes its full computational power. To do that, the pipeline must have parallelizable stages.

The next example implements a full stereo disparity estimation and Harris corners extraction pipeline, which presents plenty of opportunities for parallelization.

The diagram reveals three stage parallelization opportunities: the independent left and right image preprocessing, and the Harris corners extraction. The pipeline uses a different backend for each processing stage, depending on each backend's processing speed, power requirements, input and output restrictions, and availability. In this example, processing is split among the following backends:

  • VIC: Does stereo pair rectification and downscaling.
  • CUDA: Does image format conversion.
  • PVA: Does stereo disparity calculation.
  • CPU: Handles some preprocessing and extraction of Harris corners.

This choice of backends keeps the GPU free for processing other tasks, such as deep learning inference stages. The image format conversion operation is quite fast on CUDA, and does not interfere much. The CPU is kept busy extracting Harris keypoints undisturbed.

The following diagram shows how the algorithms are split into streams and how the streams are synchronized.

Both stream left and stream right start stereo pair preprocessing, while the keypoints stream waits until the right grayscale image is ready. Once it's ready, Harris corner detection starts while stream right continues preprocessing. When preprocessing ends on the left stream, the stream waits until the right downscaled image is ready. Finally, stereo disparity estimation starts with its two stereo inputs. At any point the host thread can issue a vpiStreamSync call in both left and keypoints stream to wait until the disparity and keypoints data is ready for further processing or display.

The outline above explains the code that implements this pipeline:

  1. Include headers for all the objects used, as well as all of the required algorithms.
    #include <string.h>
    #include <vpi/Array.h>
    #include <vpi/Context.h>
    #include <vpi/Event.h>
    #include <vpi/Image.h>
    #include <vpi/Stream.h>
    #include <vpi/WarpMap.h>
    #include <vpi/algo/Remap.h>
    Functions and structures for dealing with VPI arrays.
    Declares functions that implement the Bilateral Filter algorithm.
    Functions and structures for dealing with VPI contexts.
    Declares functions that handle image format conversion.
    Functions and structures for dealing with VPI events.
    Declares functions that implement the Harris Corner Detector algorithm.
    Declares functions to generate warp maps based on common lens distortion models.
    Declares functions that implement the Remap algorithm.
    Declares functions that implement the Rescale algorithm.
    Declares functions that implement stereo disparity estimation algorithms.
    Declares functions that implement the WarpMap structure and related functions.
  2. Execute the initialization phase, where all the required objects are created.
    1. Create a context and make it active.

      Although you can use the default context that is created automatically to manage the VPI state, it is may be more convenient to create a context and use it to handle all objects linked to a particular pipeline throughout their lifetimes. In the end, context destruction triggers destruction of the objects created under it. Using a dedicated context also yields better isolation between this pipeline and others that the application might use.

      int main()
      {
      vpiContextCreate(0, &ctx);
      VPIStatus vpiContextCreate(uint64_t flags, VPIContext *ctx)
      Create a context instance.
      VPIStatus vpiContextSetCurrent(VPIContext ctx)
      Sets the context for the calling thread.
      struct VPIContextImpl * VPIContext
      A handle to a context.
      Definition: Types.h:238
    2. Create the streams.

      Create the streams with all zero flags, meaning that they can handle tasks for all backends.

      There are two streams to handle stereo pair preprocessing, and a third for Harris corner detection. When preprocessing is finished, stream_left is reused for stereo disparity estimation.

      VPIStream stream_left, stream_right, stream_keypoints;
      vpiStreamCreate(0, &stream_left);
      vpiStreamCreate(0, &stream_right);
      vpiStreamCreate(0, &stream_keypoints);
    3. Create the input image buffer wrappers.

      Assuming that the input comes from a capture pipeline as EGLImage, you can wrap the buffers in a VPIImage to be used in a VPI pipeline. All the pipeline requires is one frame (usually the first) from each stereo input.

      EGLImageKHR eglLeftFrame = /* First frame from left camera */;
      EGLImageKHR eglRightFrame = /* First frame from right camera */;
      VPIImage left, right;
      VPIImageData dataLeft;
      dataLeft.buffer.egl = eglLeftFrame;
      vpiImageCreateWrapper(&dataLeft, NULL, 0, &left);
      VPIImageData dataRight;
      dataRight.buffer.egl = eglRightFrame;
      vpiImageCreateWrapper(&dataRight, NULL, 0, &right);
      VPIImageBuffer buffer
      Stores the image contents.
      Definition: Image.h:237
      EGLImageKHR egl
      Image stored as an EGLImageKHR.
      Definition: Image.h:218
      VPIImageBufferType bufferType
      Type of image buffer.
      Definition: Image.h:234
      VPIStatus vpiImageCreateWrapper(const VPIImageData *data, const VPIImageWrapperParams *params, uint64_t flags, VPIImage *img)
      Create an image object by wrapping an existing memory block.
      @ VPI_IMAGE_BUFFER_EGLIMAGE
      EGLImage.
      Definition: Image.h:185
      Stores information about image characteristics and content.
      Definition: Image.h:230
    4. Create the image buffers to be used.

      Like the simple pipeline, this pipeline creates empty input images. These input images must be populated either by wrapping images existing in memory, or from the output of an earlier VPI pipeline.

      The input is a 640x480 NV12 (color) stereo pair, typically output by camera capture pipelines. The temporary images are needed for storing intermediate results. The format conversion is necessary because the stereo disparity estimator and Harris corner extractor expect grayscale images. Moreover, stereo disparity expects its input to be exactly 480x270. This is accomplished by the rescale stage in the diagram above.

      VPIImage left_rectified, right_rectified;
      vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_NV12_ER, 0, &left_rectified);
      vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_NV12_ER, 0, &right_rectified);
      VPIImage left_grayscale, right_grayscale;
      vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U16, 0, &left_grayscale);
      vpiImageCreate(640, 480, VPI_IMAGE_FORMAT_U16, 0, &right_grayscale);
      VPIImage left_reduced, right_reduced;
      vpiImageCreate(480, 270, VPI_IMAGE_FORMAT_U16, 0, &left_reduced);
      vpiImageCreate(480, 270, VPI_IMAGE_FORMAT_U16, 0, &right_reduced);
      VPIImage disparity;
      vpiImageCreate(480, 270, VPI_IMAGE_FORMAT_U16, 0, &disparity);
      #define VPI_IMAGE_FORMAT_U16
      Single plane with one 16-bit unsigned integer channel.
      Definition: ImageFormat.h:109
      #define VPI_IMAGE_FORMAT_NV12_ER
      YUV420sp 8-bit pitch-linear format with full range.
      Definition: ImageFormat.h:206
    5. Define stereo disparity algorithm parameters and create the payload.

      Stereo disparity processing requires some temporary data. VPI calls this data a payload. In this example, vpiCreateStereoDisparityEstimator is called and passed all of the parameters required by the internal allocator to specify the size of the temporary data.

      Because the temporary data is allocated on a backend device, the payload is tightly coupled to the backend. If the same algorithm is to be executed in different backends, or concurrently using the same backend in different streams, it requires a payload for each backend or stream. In this example, the payload is created for execution by the PVA backend.

      As for algorithm parameters, the VPI stereo disparity estimator is implemented by a semi-global stereo matching algorithm. The estimator requires the census transform window size, specified as 5, and the maximum number of disparity levels, specified as 64. For more information, see Stereo Disparity Estimator.

      stereo_params.windowSize = 5;
      stereo_params.maxDisparity = 64;
      stereo_params.maxDisparity = stereo_params.maxDisparity;
      VPIPayload stereo;
      &stereo);
      int32_t windowSize
      Represents the median filter size (on PVA+NVENC+VIC or OFA+PVA+VIC backend) or census transform windo...
      int32_t maxDisparity
      Maximum disparity for matching search.
      VPIStatus vpiInitStereoDisparityEstimatorCreationParams(VPIStereoDisparityEstimatorCreationParams *params)
      Initializes VPIStereoDisparityEstimatorCreationParams with default values.
      VPIStatus vpiCreateStereoDisparityEstimator(uint64_t backends, int32_t imageWidth, int32_t imageHeight, VPIImageFormat inputFormat, const VPIStereoDisparityEstimatorCreationParams *params, VPIPayload *payload)
      Creates payload for vpiSubmitStereoDisparityEstimator.
      Structure that defines the parameters for vpiCreateStereoDisparityEstimator.
      Structure that defines the parameters for vpiSubmitStereoDisparityEstimator.
    6. Create the image rectification payload and corresponding parameters. It does lens distortion correction using the Remap algorithm. Here the stereo lens parameters are specified. Because they are different for left and right lenses, two remap payloads are created. For more details, see Lens Distortion Correction.

      memset(&dist, 0, sizeof(dist));
      dist.k1 = -0.126;
      dist.k2 = 0.004;
      const VPICameraIntrinsic Kleft =
      {
      {466.5, 0, 321.2},
      {0, 466.5, 239.5}
      };
      const VPICameraIntrinsic Kright =
      {
      {466.2, 0, 320.3},
      {0, 466.2, 239.9}
      };
      {
      {1, 0.0008, -0.0095, 0},
      {-0.0007, 1, 0.0038, 0},
      {0.0095, -0.0038, 0.9999, 0}
      };
      memset(&map, 0, sizeof(map));
      map.grid.regionWidth[0] = 640;
      map.grid.regionHeight[0] = 480;
      map.grid.horizInterval[0] = 4;
      map.grid.vertInterval[0] = 4;
      VPIPayload ldc_left;
      vpiCreateRemap(VPI_BACKEND_VIC, &map, &ldc_left);
      VPIPayload ldc_right;
      vpiCreateRemap(VPI_BACKEND_VIC, &map, &ldc_right);
      VPIStatus vpiWarpMapGenerateFromPolynomialLensDistortionModel(const VPICameraIntrinsic Kin, const VPICameraExtrinsic X, const VPICameraIntrinsic Kout, const VPIPolynomialLensDistortionModel *distModel, VPIWarpMap *warpMap)
      Generates a mapping that corrects image using polynomial lens distortion model.
      float VPICameraExtrinsic[3][4]
      Camera extrinsic matrix.
      Definition: Types.h:486
      float VPICameraIntrinsic[2][3]
      Camera intrinsic matrix.
      Definition: Types.h:473
      Holds coefficients for polynomial lens distortion model.
      VPIStatus vpiCreateRemap(uint64_t backends, const VPIWarpMap *warpMap, VPIPayload *payload)
      Create a payload for Remap algorithm.
      @ VPI_BACKEND_VIC
      VIC backend.
      Definition: Types.h:95
      int8_t numHorizRegions
      Number of regions horizontally.
      Definition: WarpGrid.h:159
      VPIWarpGrid grid
      Warp grid control point structure definition.
      Definition: WarpMap.h:91
      int16_t horizInterval[VPI_WARPGRID_MAX_HORIZ_REGIONS_COUNT]
      Horizontal spacing between control points within a given region.
      Definition: WarpGrid.h:174
      int8_t numVertRegions
      Number of regions vertically.
      Definition: WarpGrid.h:162
      int16_t vertInterval[VPI_WARPGRID_MAX_VERT_REGIONS_COUNT]
      Vertical spacing between control points within a given region.
      Definition: WarpGrid.h:180
      int16_t regionWidth[VPI_WARPGRID_MAX_HORIZ_REGIONS_COUNT]
      Width of each region.
      Definition: WarpGrid.h:165
      int16_t regionHeight[VPI_WARPGRID_MAX_VERT_REGIONS_COUNT]
      Height of each region.
      Definition: WarpGrid.h:168
      VPIStatus vpiWarpMapAllocData(VPIWarpMap *warpMap)
      Allocates the warp map's control point array for a given warp grid.
      Defines the mapping between input and output images' pixels.
      Definition: WarpMap.h:88
    7. Create output buffers for Harris Corner Detector.

      This algorithm receives an image and outputs two arrays, one with the keypoints themselves and another with the score of each keypoint. A maximum of 8192 keypoints are returned, and this must be the array capacity. Keypoints are represented by VPIKeypoint structures and scores by 32-bit unsigned values. For more information, see Harris Corner Detector.

      VPIArray keypoints, scores;
      vpiArrayCreate(8192, VPI_ARRAY_TYPE_KEYPOINT, 0, &keypoints);
      vpiArrayCreate(8192, VPI_ARRAY_TYPE_U32, 0, &scores);
      VPIStatus vpiArrayCreate(int32_t capacity, VPIArrayType type, uint64_t flags, VPIArray *array)
      Create an empty array instance.
      struct VPIArrayImpl * VPIArray
      A handle to an array.
      Definition: Types.h:232
      @ VPI_ARRAY_TYPE_U32
      Unsigned 32-bit.
      Definition: ArrayType.h:75
      @ VPI_ARRAY_TYPE_KEYPOINT
      VPIKeypoint element.
      Definition: ArrayType.h:76
    8. Define Harris detector parameters and create the detector's payload.

      Fill the VPIHarrisCornerDetectorParams structure with the required parameters. See the structure documentation for more information about each parameter.

      Like stereo disparity, the Harris detector requires a payload. This time only the input size (640x480) is needed. The pipeline only accepts input payloads of this size.

      harris_params.gradientSize = 5;
      harris_params.blockSize = 5;
      harris_params.strengthThresh = 10;
      harris_params.sensitivity = 0.4f;
      VPIPayload harris;
      int32_t gradientSize
      Gradient window size.
      Definition: HarrisCorners.h:85
      int32_t blockSize
      Block window size used to compute the Harris Corner score.
      Definition: HarrisCorners.h:89
      float strengthThresh
      Specifies the minimum threshold with which to eliminate Harris Corner scores.
      Definition: HarrisCorners.h:93
      float sensitivity
      Specifies sensitivity threshold from the Harris-Stephens equation.
      Definition: HarrisCorners.h:96
      VPIStatus vpiInitHarrisCornerDetectorParams(VPIHarrisCornerDetectorParams *params)
      Initializes VPIHarrisCornerDetectorParams with default values.
      VPIStatus vpiCreateHarrisCornerDetector(uint64_t backends, int32_t inputWidth, int32_t inputHeight, VPIPayload *payload)
      Creates a Harris Corner Detector payload.
      Structure that defines the parameters for vpiSubmitHarrisCornerDetector.
      Definition: HarrisCorners.h:82
      @ VPI_BACKEND_CPU
      CPU backend.
      Definition: Types.h:92
    9. Create events to implement barrier synchronization.

      Events are used for inter-stream synchronization. They are implemented with VPIEvent. The pipeline needs two barriers: one to wait for the input to Harris corner extraction to be ready, and the other for the preprocessed right image.

      VPIEvent barrier_right_grayscale, barrier_right_reduced;
      vpiEventCreate(0, &barrier_right_grayscale);
      vpiEventCreate(0, &barrier_right_reduced);
      struct VPIEventImpl * VPIEvent
      A handle to an event.
      Definition: Types.h:244
      VPIStatus vpiEventCreate(uint64_t flags, VPIEvent *event)
      Create an event instance.
  3. After initialization comes the main processing phase, which implements the pipeline by submitting algorithms and events to the streams in the correct order. The pipeline's main loop can do this many times using the same events, payloads, temporary buffers, and output buffers. The input is usually redefined for each iteration, as shown below.
    1. Submit the left frame processing stages.

      Lens distortion correction, image format conversion, and downscaling are submitted to the left stream. Note again that the submit operations are non-blocking and return immediately.

      vpiSubmitRemap(stream_left, VPI_BACKEND_VIC, ldc_left, left, left_rectified, VPI_INTERP_CATMULL_ROM,
      vpiSubmitConvertImageFormat(stream_left, VPI_BACKEND_CUDA, left_rectified, left_grayscale, NULL);
      vpiSubmitRescale(stream_left, VPI_BACKEND_VIC, left_grayscale, left_reduced, VPI_INTERP_LINEAR, VPI_BORDER_CLAMP,
      0);
      VPIStatus vpiSubmitConvertImageFormat(VPIStream stream, uint64_t backend, VPIImage input, VPIImage output, const VPIConvertImageFormatParams *params)
      Converts the image contents to the desired format, with optional scaling and offset.
      VPIStatus vpiSubmitRemap(VPIStream stream, uint64_t backend, VPIPayload payload, VPIImage input, VPIImage output, VPIInterpolationType interp, VPIBorderExtension border, uint64_t flags)
      Submits a Remap operation to the stream.
      VPIStatus vpiSubmitRescale(VPIStream stream, uint64_t backend, VPIImage input, VPIImage output, VPIInterpolationType interpolationType, VPIBorderExtension border, uint64_t flags)
      Changes the size and scale of a 2D image.
      @ VPI_INTERP_LINEAR
      Linear interpolation.
      Definition: Interpolation.h:93
      @ VPI_INTERP_CATMULL_ROM
      Catmull-Rom cubic interpolation.
    2. Submit the first few stages of the right frame preprocessing.

      The lens distortion correction and image format conversion stages result in a grayscale image for input to Harris corner extraction.

      vpiSubmitRemap(stream_right, VPI_BACKEND_VIC, ldc_right, right, right_rectified, VPI_INTERP_CATMULL_ROM,
      vpiSubmitConvertImageFormat(stream_right, VPI_BACKEND_CUDA, right_rectified, right_grayscale, NULL);
    3. Record the right stream state so that the keypoints stream can synchronize to it.

      The keypoint stream can only start when its input is ready. First, the barrier_right_grayscale event must record the right stream state by submitting a task to it that will signal the event when format conversion completes.

      vpiEventRecord(barrier_right_grayscale, stream_right);
      VPIStatus vpiEventRecord(VPIEvent event, VPIStream stream)
      Captures in the event the contents of the stream command queue at the time of this call.
    4. Finish the right frame preprocessing with a downscale operation.

      vpiSubmitRescale(stream_right, VPI_BACKEND_VIC, right_grayscale, right_reduced, VPI_INTERP_LINEAR, VPI_BORDER_CLAMP,
      0);
    5. Record the right stream state so that the left stream can synchronize to it.

      With the whole of right preprocessing submitted, the stream state must be recorded again so that the left stream can wait until the right frame is ready.

      vpiEventRecord(barrier_right_reduced, stream_right);
    6. Make the left stream wait until the right frame is ready.

      Stereo disparity requires the left and right frames to be ready. The pipeline uses vpiStreamWaitEvent to submit a task to the left stream that will wait until the barrier_right_reduced event is signaled on the right stream, meaning that right frame preprocessing is finished.

      vpiStreamWaitEvent(stream_keypoints, barrier_right_grayscale);
      VPIStatus vpiStreamWaitEvent(VPIStream stream, VPIEvent event)
      Pushes a command that blocks the processing of all future commands submitted to the stream until the ...
    7. Submit the stereo disparity algorithm.

      The input images are now ready. Call vpiSubmitStereoDisparityEstimator to submit the disparty estimator.

      vpiSubmitStereoDisparityEstimator(stream_left, VPI_BACKEND_CUDA, stereo, left_reduced, right_reduced, disparity,
      NULL, &stereo_params);
      VPIStatus vpiSubmitStereoDisparityEstimator(VPIStream stream, uint64_t backend, VPIPayload payload, VPIImage left, VPIImage right, VPIImage disparity, VPIImage confidenceMap, const VPIStereoDisparityEstimatorParams *params)
      Runs stereo processing on a pair of images and outputs a disparity map.
    8. Submit the keypoint detector pipeline.

      For keypoint detection, first submit a wait operation on the barrier_right_grayscale event to make the pipeline wait until the input is ready. Then submit the Harris corner detector on it.

      vpiSubmitHarrisCornerDetector(stream_keypoints, VPI_BACKEND_CPU, harris, right_grayscale, keypoints, scores,
      &harris_params);
      VPIStatus vpiSubmitHarrisCornerDetector(VPIStream stream, uint64_t backend, VPIPayload payload, VPIImage input, VPIArray outFeatures, VPIArray outScores, const VPIHarrisCornerDetectorParams *params)
      Submits a Harris Corner Detector operation to the stream.
    9. Synchronize the streams to use the disparity map and keypoints detected.

      Remember that the functions called so far in the processing phase are all asynchronous; they return immediately once the job is queued on the stream for later execution.

      More processing can now be performed on the main thread, such as updating GUI status information or displaying the previous frame. This occurs while VPI is executing the pipeline. Once this additional processing is performed, the streams that process the final result from the current frame must be synchronized using vpiStreamSync. Then the resulting buffers can be accessed.

      vpiStreamSync(stream_left);
      vpiStreamSync(stream_keypoints);
    10. Fetch the next frame and update the input wrappers.

      The existing input VPI image wrappers can be redefined to wrap the next two stereo pair frames, provided that their dimensions and format are the same. This operation is quite efficient, as it is done without heap memory allocations.

      eglLeftFrame = /* Fetch next frame from left camera */;
      eglRightFrame = /* Fetch next from right camera */;
      dataLeft.buffer.egl = eglLeftFrame;
      vpiImageSetWrapper(left, &dataLeft);
      dataRight.buffer.egl = eglRightFrame;
      vpiImageSetWrapper(right, &dataRight);
      VPIStatus vpiImageSetWrapper(VPIImage img, const VPIImageData *data)
      Redefines the wrapped memory in an existing VPIImage wrapper.
  4. Destroy the context.

    This example has created many objects under the current context. Once all processing is completed and the pipeline is no longer needed, destroy the context. All streams are then synchronized and destroyed, along with all other objects used. No memory leaks are possible.

    Destroying the current context reactivates the context that was active before the current one became active.

    return 0;
    }
    void vpiContextDestroy(VPIContext ctx)
    Destroy a context instance as well as all resources it owns.

Important takeaways from these examples:

  • Algorithm submission returns immediately.
  • Algorithm execution occurs asynchronously with respect to the host thread.
  • Different streams can use the same buffers, although you must avoid race conditions by using events.
  • A context owns all objects created by one user thread while it is in an active state. This allows some interesting scenarios where one thread sets up the context and triggers all the processing pipeline, then moves the whole context to another thread that waits for the pipeline to end, then triggers further data processing.