This page explains how VPI performance information is gathered, how to interpret it and how you can reproduce the results.
For performance comparisons between VPI and other Computer Vision libraries, please consult Performance Comparison page.
Most of the algorithms' description pages have a section where the running time of a single call/iteration is shown, along with the parameters used. This information can be helpful to performance-critical applications, allowing the user to evaluate the impact of certain parameter/backend combinations in application performance, such as:
In each table, the target Jetson device and the number of parallel algorithm invocations can be selected, i.e., the number of parallel VPIStream executing the algorithm. This allows analysis of how the average running time of each algorithm invocation changes with the number of parallel algorithm executions. Say one VPIStream is set up to apply a box filter on an image. This operation alone takes a certain amount of time. Now suppose the program must process several images in parallel, each one in its own VPIStream. How does the average running time of each box filter operation change with the number of parallel streams? This is answered by changing the stream count in the performance table.
For most backends and algorithms, the average running time of each invocation increases linearly with the number of streams. A notable exception happens with the PVA backend. Since NVIDIA ®Jetson Xavier™ devices have two independent PVA processors, each one with two parallel vector processors, the PVA backend is fully utilized only when four or more parallel VPIStream are being used. Hence executing one to four parallel PVA algorithm instances won't increase the elapsed running time of each instance.
Although VPI can be used in x86 architectures with discrete GPUs, the large number of configurations with different performances makes it difficult to make a useful comparison between them and the different algorithms VPI supports. For this reason, benchmarking is restricted to Jetson devices.
All algorithm elapsed time measurements are shown with their confidence intervals, such as \(212.4 \pm 0.4\) ms or \(0.024 \pm 0.002\) ms. These intervals represent a confidence level of 99.73% ( \(\pm 3\sigma\)) that the true elapsed time lies inside it. This assumes that the measurements are drawn from a normal distribution.
The confidence intervals must be taken into account when comparing measurements.
Example:
In order to make meaningful elapsed time comparison across different backends, the corresponding processors must be fully utilized. As an example, consider the Convolution performance table. When comparing CPU and PVA on a NVIDIA® Jetson AGX Xavier™ with one stream, CPU is faster since it has eight cores, and each algorithm invocation fully utilizes all of them, whereas PVA is only using one vector processor from one PVA processor, 1/4th of the installed PVA capacity. Now changing the number of parallel streams to four, PVA elapsed time increases just a bit, but CPU increases roughly 4x. In this context PVA is faster than CPU.
The conclusion is that the PVA backend scales better than other backends with more parallel streams until the processor is saturated. On top of that, it's also more energy efficient.
The benchmark procedure used to measure the performance numbers is described in detail here. This information helps explain what context the performance numbers refer to.
The Benchmarking sample demonstrates how a simplified version of the above is implemented. It can be used to get a good approximation of the values shown in the benchmark tables of each algorithm.
To make the measurements somewhat stable across runs, the device's frequency and power parameters were maxed out prior to benchmarking. This also mimics the situation where the system is under full load, thus making execution time closer to the lower bound. In real applications, depending on the system load, the execution time might be longer due to frequency throttling and other effects.
Jetson devices flashed via the JetPack installer come with utilities that can be used to control clock frequencies and power settings of available processors. In order to simplify the process, the script below will be used to set CPU, GPU, PVA, VIC and EMC (memory controller) clock frequencies to maximum.
Copy the script below to a file named clocks.sh
in the device and set its attribute for execution with chmod +x clocks.sh
.
To maximize the clock frequencies, run:
Once the measurements are done, restore the clock frequencies with:
What follows is a list of all devices used for measurement, along with their frequency and power configurations.