VPI is a library that provides a collection of computer vision and image processing algorithms that can be seamlessly executed in a variety of hardware accelerators, called backends.
The goal is to provide a uniform interface to these backends, while maintaining high performance. To achieve that, several shared memory mapping mechanisms between backends are used, depending on memory characteristics, coupled with high performance implementations of algorithms and availability of backend-agnostic event synchronization mechanisms.
The VPI architectural overview is as follows:
The API follows the paradigm where object allocation and setup take place in an initialization phase. The application loop follows, where the main processing occurs, using the objects created during initialization. Once completed, the created objects are destroyed and the environment is cleaned up. For robotics software applications where memory allocations are limited in both time and space, this environment cleanup is beneficial.
The core components of VPI include:
Pipeline examples, and how to implement them using VPI, are explained in the following sections.
In this example, a simple pipeline is implemented with a given input image. A box filter is applied and produces a blurred version.
The code for implementing the pipeline is as follows.
Include necessary headers. In this example, image buffers are used, a stream, and the BoxImageFilter
algorithm.
Create the image buffers to be used.
A 640x480 1-channel (grayscale) input image is created with unsigned 8-bit pixel elements, represented by VPI_IMAGE_TYPE_Y8
enum (Y stands for luma). Images are initialized with zeros upon creation. The intent is to work with all backends supported, pass 0 as flags, and pass the pointer to where the image handle will be written. The output image is created the same way.
Create a CUDA stream to execute the algorithm on the first CUDA device detected on the system. The function returns an error if no such devices exist or can be used.
Submit the box filter algorithm to the stream, passing along the input and output images, and other parameters; such as 3x3 kernel, and clamp boundary condition.
Because of the asynchronous nature of streams, the algorithm is pushed to the CUDA execution queue of the stream, and the function returns immediately. This allows the program to continue assembling the processing pipeline, or do something else, while the algorithm executes.
Wait until the stream finishes processing.
This function blocks until all algorithms submitted to the stream finish executing. This function must be called to show the output to the user, to save it to disk, etc.
Destroy created objects.
Upon completion, destroy the created objects to avoid memory leaks. Destroying a stream forces it to synchronize, but destroying images that are still being used by an algorithm leads to undefined behavior, likely resulting in a program crash.
Examining how several VPI objects work together, and inspecting the ownership relationship between objects is a beneficial learning exercise. The structure of the provided example is as follows.
Where:
Finally, the call to vpiStreamSync adds a Sync event job Job2 to the queue and blocks processing on the calling thread. This event gets signaled once Job1 finishes, unblocking the call and allowing the calling thread to continue.
More complex scenarios can be envisioned that take advantage of different acceleration processors on the device and create a pipeline that strives to fully utilize the computational power.
To do that, the pipeline must have parallelizable stages. For example, the pipeline, given a stereo pair, calculates the stereo disparity and extracts Harris keypoints from the right image.
Three stage parallelization opportunities are identified: the independent left and right image pre-processing, and Harris keypoint extraction. A different backend can be designated for each one of these parallel streams, depending on the processing speed of each backend, power requirements, input and output restrictions, and availability. In this example, the whole processing among CUDA, PVA and CPU backends is split:
The rationale for this choice is that since PVA can handle up to four streams completely in parallel, two of them can be used to process the stereo pair. The CPU, which is usually slower, is kept busy performing Harris keypoints undisturbed. Finally, CUDA handles the PVA output and calculates the stereo disparity at the end.
In the sequence diagram that follows:
As shown in the diagram, the fact that CUDA is idle during PVA processing leaves it free to do some other tasks outside VPI, such as Deep Learning inference stages.
Notice how a barrier is implemented using the event primitives. Stereo disparity must start after the PVA streams complete processing their data. One event is set up for each PVA stream to be signaled when it is done. The CUDA stream waits until this happens. Only then it is allowed to proceed and work on the PVA output.
The code that implements this pipeline is explained as follows.
Create a context and make it active.
Although the default context that is automatically created to manage the VPI state can be used, it is easier to create a context and use it to handle the lifetime of all objects created under it. In the end, the context and all objects are automatically destroyed, reducing the chances of leaking memory. This approach makes sense because many objects are being created.
Create the streams.
During stream creation, specify the backend, also called device type, that will ultimately execute the submitted algorithms.
Two PVA streams are instantiated. In Jetson AGX Xavier there are two PVA processors, each one capable of processing two streams in parallel. VPI implements a simplistic form of load balancing between them, where PVA stream creation picks one of the available PVA processors in a round-robin fashion. The PVA processor, in turn, picks an available stream also using round-robin fashion.
Create stereo pipeline images.
Again, similar to the simple pipeline, create empty input images. In reality these input images must be populated by either wrapping existing memory, or by being the result of an earlier VPI pipeline.
The input is a 640x480 stereo pair, 16-bit unsigned pixels (HDR). The temporary images for the blurred and reduced copies and the image with the disparity map to be estimated are created. The result is a down-sampling of the input to make stereo disparity processing faster, since high resolution disparity map is not required.
Define stereo disparity algorithm parameters.
The VPI stereo disparity estimator is implemented by a semi-global stereo matching algorithm. The estimator requires the census transform window size, specified as 5, and the maximum disparity levels, specified as 64.
Create the stereo disparity payload.
Stereo disparity processing requires some temporary data. VPI calls it payload. In this example, vpiCreateStereoDisparityEstimator is called and passed all the required parameters by the internal allocator to decide the size of the temporary data.
Because the temporary data is allocated on a backend device, the payload is tightly coupled to the backend. If the same algorithm is needed in different streams, one payload for each must be created. In this example, it is created for the CUDA stream.
Create image and array buffers for Harris keypoint detector.
This algorithm receives an image and outputs two arrays, one with the keypoints themselves and another with the score of each keypoint. At most, 8192 keypoints are allowed to be found. Keypoints are represented by the VPIKeypoint structure and scores are 32-bit unsigned values.
Define Harris detector parameters.
Fill the VPIHarrisKeypointDetectorParams structure with the required parameters. Refer to the structure documentation for more information about each parameter.
Create the Harris detector payload.
Like stereo disparity, Harris detector requires a payload. This time only the input size, 640x480 is needed. When using this payload, only inputs of this size are accepted.
Create the events to implement a barrier synchronization.
CUDA stream must wait for two PVA streams to finish their job, consequently two events are required, one for each stream.
Implement the keypoint detector pipeline. Perform a simplistic de-noise pass with the bilateral filter, then use Harris detector to get the keypoints.
Start with the pre-processing of the left image: a low-pass filter followed by image down-sampling using bilinear interpolation.
The right image processing is exactly the same.
Implement the barrier synchronization.
Use the vpiEventRecord function to record, in each event, the corresponding PVA stream state at this point. Since all jobs are already submitted, these events are signaled once all the jobs submitted at this point get processed.
Then the vpiStreamWaitFor call submits jobs to the CUDA stream, making it wait until both events get signaled. At the point the stream execution is released from this wait, the data for stereo disparity will be ready to be processed.
Submit the stereo disparity algorithm.
At this point the input images are ready. Call vpiSubmitStereoDisparityEstimator to the CUDA stream for processing.
Synchronize the streams to use the disparity map and keypoints detected.
Remember that the functions called so far in processing phase are all asynchronous; they return immediately once the job is queued on the stream for later execution.
Now, more processing can be performed on the main thread, such as update some GUI status or showing the previous frame. This occurs while VPI is executing the pipeline. Once this additional processing is performed, synchronize the streams that are processing the final result using vpiStreamSync. Once completed, the resulting buffers can be accessed. Once this is done, loop back to the beginning of the processing phase or go to the deinitialization phase.
Context destruction.
In this example, many objects were created under the current context. Once all processing is completed and the pipeline is no longer required, destroy the context. All streams will be synchronized and destroyed, along with all other objects used. No memory leaks are possible.
Destroying the current context activates the previous context that was active before the former was set to active.
Important takeaways from these examples:
VPI contexts serve as a container of other VPI objects along with some configurations that apply to them.
Every host thread has an active context. VPI objects created while a context is active are owned by it.
For user-created contexts, the user can specify during creation what backends this context supports at most. This effectively allows the user to mask support for a particular hardware. For example, creating a stream for CUDA backend fails if the current context has the VPI_CONTEXT_NO_CUDA flag set. These flags are automatically updated to reflect the actual platform support for a particular backend. For example, if PVA backend is not available, contexts will have the flag VPI_CONTEXT_NO_PVA
set by default. However, note that the CPU backend cannot be masked out and must always be supported as a fallback implementation.
Sharing objects (buffers, payloads, events, ...) among different contexts is not permitted.
There can be a maximum of 8 user contexts created at any given time.
The current context can be manipulated by the user if needed.
By default, there is a single global context created before the first VPI object is created.
This global context is initially shared among all user threads and cannot be destroyed by the user.
For most applications, the user can use the global context. When some finer control is required, for how objects are grouped together, or some level of independence is needed, the user may wish to explicitly create and manipulate contexts.
Each user thread has a context stack not shared with other threads.
The top context in the stack is the current context for that thread.
By default, the context stack has one context in it, the global context. Consequently, all new threads have the same global context set as current thread.
Making a context current in a given stack amounts to replacing the top context, either the global context or the most recently pushed context, with the given context. The replaced context does not belong to the stack anymore.
However, pushing a context into a stack does not replace anything. The top context is kept in the stack and the new pushed context is put at the top, thereby becoming the new current context.
The user can push and pop contexts from the stack at will. This allows for temporarily creating pipelines in a new context without disturbing the existing context.
To avoid leakage, it is important to match the number of pushes and pops on a given context stack. Be aware that the context stack can have at most 8 contexts in it.
The main entry-point to the API is the VPIStream object. This object represents:
VPIStream
instance for a specific backend allocates system resources and sets up the hardware to run API functionality.VPIStream
instance command queue and immediately returns. The queued commands gets consumed by the underlying backend implementation.VPIPayload
.There can be at most 8 streams allocated in a given context.
The API supports creating stream instances for 3 backends: CUDA, CPU, and PVA. The total number of stream instances for each backend is currently limited to 8. Also, at most 8 streams of any type can be created under the same VPI context. VPI permits simultaneous creation and use of multiple instances of the same type. For many CV algorithms, this is the only way to saturate the underlying backend (CUDA/PVA).
Each stream may launch a task queue thread to handle asynchronous task execution. It is not specified when the thread is created. However, this is usually upon stream creation, lasting until it is destroyed.
Buffers represent the data VPI algorithms work with. Abstractions for three kinds of data are provided:
Users can have VPI manage allocation of all three types of buffers. Or, for images and arrays, existing memory can be wrapped into a VPI buffer. The This is useful when interoperability with other libraries is required, such as using a OpenCV cv::Mat
buffer as input to a VPI algorithm.
Common attributes for all buffer types are their size and the element type.
VPI images represent any kind of 2D data, such as images themselves, vector fields embedded in a 2D space, 2D heat maps, etc.
The images are characterized by their width, height, element type and flags.
The flags are used to specify which backend can work with them. Passing 0 means to make the image work with all available backends. If some backends won't be used, pass any of the following flags, or'ed together if needed:
VPI also provides the following helper flags when only one backend is needed:
Image data can be accessed from host using the vpiImageLock function. The function requires that the image have the CPU backend enabled. It will fill the VPIImageData
with image information that allows the user to properly address and interpret all image pixels. Once the user is done working on the image data from host, vpiImageUnlock must be called. Once the image is locked, it can't be locked again, nor accessed from an algorithm running asynchronously in a VPI stream.
VPI supports a variety of image types representing different element types, such as single-channel 8-, 16- or 32-bit, unsigned and signed, multi-channel RGB and RGBA, multi-planar NV12. Not all algorithms support images with all types. The image type documentation defines the proper constraints on images.
Users can create images that wrap externally allocated CUDA and host memory using the functions vpiImageWrapCudaDeviceMem and vpiImageWrapHostMem respectively. In both cases, the user must fill a VPIImageData structure with the required information and pass it to the function.
It's also possible to wrap an EGLImage
handle using vpiImageWrapEglImage. In all these cases, the VPIImage
object doesn't own the memory buffer. When the VPIImage
is destroyed, the buffer isn't deallocated.
VPI arrays represent 1D data, such as keypoint lists, bounding boxes, transforms, etc.
Arrays are characterized by their capacity, size, element type and flags. As with images, the flags are used to specify which backend can work with them.
The following flags are available for arrays:
Similarly, when only one backend is needed, pass one of the following flags:
Array data can be accessed from host using the vpiArrayLock function. It works like its image counterpart.
VPIArray
has a unique feature that while the capacity of the array is fixed for the lifetime of the object, the size can change. Any API that outputs to an array will set the size parameter to the number of valid elements contained in the array. The user also has the ability to use vpiArrayGetSize and vpiArraySetSize to query and modify the size of an array, but these APIs can only be used while an array is locked and accessible from host. Another constraint is that in order to update the array's size, it must be locked for writing, done by passing VPI_LOCK_READ_WRITE
to vpiArrayLock
.
Users can also create arrays that wrap externally allocated CUDA and host memory using the functions vpiArrayWrapCudaDeviceMem and vpiArrayWrapHostMem respectively. In both cases, the user must fill a VPIArrayData structure with the required information and pass it to the function.
VPI pyramids represent a collection of VPI images stacked together, all having the same type, but possibly different dimensions.
Pyramids are characterized by their number of levels, base level dimensions, scale factor and image type. The scale factor represents the ratio of one level dimension over the prior level dimension. For instance, when scale=0.5, the pyramid is dyadic.
Often it's required to process one pyramid level as input or output to a VPI algorithm. The user must then use vpiImageWrapPyramidLevel specifying the pyramid and which level is to be wrapped. The returned VPIImage
handle can be used as any other image. Once work on this image is done, it must be destroyed with vpiImageDestroy.
As with images and arrays, the user can access the whole pyramid data from host using the function vpiPyramidLock, provided that the pyramid is enabled for CPU backend. This function fills a VPIPyramidData structure that is basically a vector of VPIImageData. Once work with VPIPyramidData is done, call vpiPyramidUnlock to unmap the pyramid from host and free resources. Once a pyramid is locked, it can't be locked again, nor be used by a VPI algorithm.
Each compute function in the API is executed asynchronously with respect to the calling thread, i.e., returns immediately without waiting for the completion. There are two ways of synchronizing with the backend. One is to wait until all the commands in the VPIStream queue are finished, with vpiStreamSync call. This approach, while simple, doesn't allow for fine-grained ("wait until function X is completed") or inter-stream ("before running function A in stream B, wait until function C in stream D finishes") synchronization. That's where VPIEvent objects come in. Conceptually they correspond to binary semaphores and are designed to closely mimic events in CUDA API:
All API functions are thread-safe. Concurrent host access to API objects is serialized and executed in an unspecified order. All API calls use a VPIContext
instance that is thread-specific and stored in TLS. If the context pointer for the current thread is NULL (no context is set), all API calls will use a default "global" context created during library initialization. API objects have no concept of thread affinity; in other words, if both threads use the same context instance, the object created in one thread can be safely destroyed by another thread.
Most of the API functions are non-blocking. Specifically, the set of functions that can block when called is limited to: vpiStreamSync, vpiStreamDestroy, vpiContextDestroy, vpiEventSync and the several vpiSubmit*
functions when the stream job queue is full. Since implicit synchronization in the API implementation is minimal, it's up to the user to make sure the resulting order of dependent function calls is legal. Invalid calls, however, should be always handled gracefully (via an appropriate error code) and should not lead to application crashes or corruption of objects' internal state.
The device command queue model is loosely based on CUDA Stream API, and can be summarized with the following:
VPIEvent
objects