Performance Benchmark

This page explains how VPI performance information is gathered, how to interpret it and how you can reproduce the results.

- Algorithm Performance Tables
- Comparing Algorithm Elapsed Times
- Benchmarking Method
- Clock Frequency and Power Settings

For performance comparisons between VPI and other Computer Vision libraries, please consult Performance Comparison page.

Most of the algorithms' description pages have a section where the running time of a single call/iteration is shown, along with the parameters used. This information can be helpful to performance-critical applications, allowing the user to evaluate the impact of certain parameter/backend combinations in application performance, such as:

- What NVIDIA® Jetson™ device to use
- What backend to use
- Speed/quality trade-offs
- Effect of running algorithms in parallel streams.

In each table, the target Jetson device and the number of parallel algorithm invocations can be selected, i.e., the number of parallel VPIStream executing the algorithm. This allows analysis of how the average running time of each algorithm invocation changes with the number of parallel algorithm executions. Say one VPIStream is set up to apply a box filter on an image. This operation alone takes a certain amount of time. Now suppose the program must process several images in parallel, each one in its own VPIStream. How does the average running time of each box filter operation change with the number of parallel streams? This is answered by changing the stream count in the performance table.

For most backends and algorithms, the average running time of each invocation increases linearly with the number of streams. A notable exception happens with the PVA backend. Since NVIDIA ®Jetson Orin™ devices has one independent PVA processor, each one with two parallel vector processors, the PVA backend is fully utilized only when two or more parallel VPIStream are being used. Hence executing one or two parallel PVA algorithm instances won't increase the elapsed running time of each instance.

Although VPI can be used in x86 architectures with discrete GPUs, the large number of configurations with different performances makes it difficult to make a useful comparison between them and the different algorithms VPI supports. For this reason, benchmarking is restricted to Jetson devices.

All algorithm elapsed time measurements are shown with their confidence intervals, such as \(212.4 \pm 0.4\) ms or \(0.024 \pm 0.002\) ms. These intervals represent a confidence level of 99.73% ( \(\pm 3\sigma\)) that the true elapsed time lies inside it. This assumes that the measurements are drawn from a normal distribution.

The confidence intervals must be taken into account when comparing measurements.

Example:

- \(110 \pm 5\) ms and \(100 \pm 20\) ms represent similar elapsed times since there's an overlap between the confidence intervals \((105;115)\) and \((80;120)\), respectively. It cannot be said that the first measurement is effectively higher (or slower) than the second.
- \(110 \pm 5\) ms and \(100 \pm 1\) ms represent different elapsed times with a high confidence as there's no overlap between \((105;115)\) and \((99;101)\), respectively. The first measurement can be considered higher than the second.

In order to make meaningful elapsed time comparison across different backends, the corresponding processors must be fully utilized. As an example, consider the Convolution performance table. When comparing CPU and PVA on a NVIDIA® Jetson AGX Orin™ with one stream, CPU is faster since it has eight cores, and each algorithm invocation fully utilizes all of them, whereas PVA is only using one vector processor from the PVA processor, 1/2 of the installed PVA capacity. Now changing the number of parallel streams to two, PVA elapsed time increases just a bit, but CPU increases roughly 2x. In this context PVA is faster than CPU.

The conclusion is that the PVA backend scales better than other backends with more parallel streams until the processor is saturated. On top of that, it's also more energy efficient.

The benchmark procedure used to measure the performance numbers is described in detail here. This information helps explain what context the performance numbers refer to.

- All payloads, inputs and output memory buffers are created beforehand.
- Memories have only the benchmarked backend enabled along with VPI_EXCLUSIVE_STREAM_ACCESS flag on.
- One second warm-up time running the algorithm in a loop.
- Run the algorithm in batches and measure its average running time within each batch. The number of calls in a batch varies with the approximate running time (faster algorithms, larger batch, max 100 calls). This is done to exclude the time spent performing the measurement itself from the algorithm runtime.
- Perform item 3 for at least 5s, making sure that we do it at least 10 times.
- From all average running times of each batch, we exclude the 5% lowest and highest values.
- From the result set, we take the median. This is the value used as final run time for the algorithm.

The Benchmarking sample demonstrates how a simplified version of the above is implemented. It can be used to get a good approximation of the values shown in the benchmark tables of each algorithm.

To make the measurements somewhat stable across runs, the device's frequency and power parameters were maxed out prior to benchmarking. This also mimics the situation where the system is under full load, thus making execution time closer to the lower bound. In real applications, depending on the system load, the execution time might be longer due to frequency throttling and other effects.

Jetson devices flashed via the JetPack installer come with utilities that can be used to control clock frequencies and power settings of available processors. In order to simplify the process, the script below will be used to set CPU, GPU, PVA, VIC and EMC (memory controller) clock frequencies to maximum.

Copy the script below to a file named `clocks.sh`

in the device and set its attribute for execution with `chmod +x clocks.sh`

.

To maximize the clock frequencies, run:

sudo ./clocks.sh --max

Once the measurements are done, restore the clock frequencies with:

sudo ./clocks.sh --restore

What follows is a list of all devices used for measurement, along with their frequency and power configurations.

- CPU: 12x ARMv8 Processor rev 1 (v8l) running at 2.255 GHz
- EMC freq.: 3.1990 GHz
- GPU freq.: 1.301 GHz
- PVA/VPS freq.: 1.370 GHz
- PVA/AXI freq.: 985.600 MHz
- VIC freq.: 729.600 MHz
- OFA freq.: 780.800 MHz
- Power mode: MAXN
- Fan speed: MAX