This page explains how VPI performance information is gathered, how to interpret it and how you can reproduce the results.

Algorithm Performance Tables

Most of the algorithms' description pages have a section where the running time of a single call/iteration is shown, along with the parameters used. This information can be helpful to performance-critical applications, allowing the user to evaluate the impact of certain parameter/backend combinations in application performance, such as:

What NVIDIA® Jetson™ device to use
What backend to use
Speed/quality trade-offs
Effect of running algorithms in parallel streams.

In each table, the target Jetson device and the number of parallel algorithm invocations can be selected, i.e., the number of parallel VPIStream executing the algorithm. This allows analysis of how the average running time of each algorithm invocation changes with the number of parallel algorithm executions. Say one VPIStream is set up to apply a box filter on an image. This operation alone takes a certain amount of time. Now suppose the program must process several images in parallel, each one in its own VPIStream. How does the average running time of each box filter operation change with the number of parallel streams? This is answered by changing the stream count in the performance table.

For most backends and algorithms, the average running time of each invocation increases linearly with the number of streams. A notable exception happens with the PVA backend. Since NVIDIA ®Jetson Thor™ devices has one independent PVA processor, each one with two parallel vector processors, the PVA backend is fully utilized only when two or more parallel VPIStream are being used. Hence executing one or two parallel PVA algorithm instances won't increase the elapsed running time of each instance.

Although VPI can be used in x86 architectures with discrete GPUs, the large number of configurations with different performances makes it difficult to make a useful comparison between them and the different algorithms VPI supports. For this reason, benchmarking is restricted to Jetson devices.

Comparing Algorithm Elapsed Times

All algorithm elapsed time measurements are shown with their confidence intervals, such as \(212.4 \pm 0.4\) ms or \(0.024 \pm 0.002\) ms. These intervals represent a confidence level of 99.73% ( \(\pm 3\sigma\)) that the true elapsed time lies inside it. This assumes that the measurements are drawn from a normal distribution.

The confidence intervals must be taken into account when comparing measurements.

Example:

\(110 \pm 5\) ms and \(100 \pm 20\) ms represent similar elapsed times since there's an overlap between the confidence intervals \((105;115)\) and \((80;120)\), respectively. It cannot be said that the first measurement is effectively higher (or slower) than the second.
\(110 \pm 5\) ms and \(100 \pm 1\) ms represent different elapsed times with a high confidence as there's no overlap between \((105;115)\) and \((99;101)\), respectively. The first measurement can be considered higher than the second.

In order to make meaningful elapsed time comparison across different backends, the corresponding processors must be fully utilized. As an example, consider the Convolution performance table. When comparing CPU and PVA on a NVIDIA® Jetson AGX Thor™ with one stream, CPU is faster since it has eight cores, and each algorithm invocation fully utilizes all of them, whereas PVA is only using one vector processor from the PVA processor, 1/2 of the installed PVA capacity. Now changing the number of parallel streams to two, PVA elapsed time increases just a bit, but CPU increases roughly 2x. In this context PVA is faster than CPU.

The conclusion is that the PVA backend scales better than other backends with more parallel streams until the processor is saturated. On top of that, it's also more energy efficient.

Benchmarking Method

The benchmark procedure used to measure the performance numbers is described in detail here. This information helps explain what context the performance numbers refer to.

All payloads, inputs and output memory buffers are created beforehand.
Memories have only the benchmarked backend enabled along with VPI_EXCLUSIVE_STREAM_ACCESS flag on.
One second warm-up time running the algorithm in a loop.
Run the algorithm in batches and measure its average running time within each batch. The number of calls in a batch varies with the approximate running time (faster algorithms, larger batch, max 100 calls). This is done to exclude the time spent performing the measurement itself from the algorithm runtime.
Perform item 3 for at least 5s, making sure that we do it at least 10 times.
From all average running times of each batch, we exclude the 5% lowest and highest values.
From the result set, we take the median. This is the value used as final run time for the algorithm.

The Benchmarking sample demonstrates how a simplified version of the above is implemented. It can be used to get a good approximation of the values shown in the benchmark tables of each algorithm.

Clock Frequency and Power Settings

To make the measurements somewhat stable across runs, the device's frequency and power parameters were maxed out prior to benchmarking. This also mimics the situation where the system is under full load, thus making execution time closer to the lower bound. In real applications, depending on the system load, the execution time might be longer due to frequency throttling and other effects.

Jetson devices flashed via the JetPack installer come with utilities that can be used to control clock frequencies and power settings of available processors. In order to simplify the process, the script below will be used to set CPU, GPU, PVA, VIC and EMC (memory controller) clock frequencies to maximum.

Copy the script below to a file named clocks.sh in the device and set its attribute for execution with chmod +x clocks.sh.

 #!/bin/bash -e
  
 if [ $(whoami) != root ]; then
     echo Error: Run this script as a root user
     exit 1
 fi
  
 clkfile=/tmp/defclocks.conf
 pwrfile=/tmp/defpower.conf
  
 # Orin?
 if [ -e /sys/devices/platform/bus@0/13e00000.host1x/15340000.vic ]; then
     vicctrl=/sys/devices/platform/bus@0/13e00000.host1x/15340000.vic
     vicfreqctrl=$vicctrl/devfreq/15340000.vic
 fi
  
 maxclocks()
 {
     if [ ! -e $clkfile ]; then
         jetson_clocks --store $clkfile
         if [ -n "$vicctrl" ]; then
             echo "$vicfreqctrl/governor:$(cat $vicfreqctrl/governor)" >> $clkfile
             echo "$vicfreqctrl/max_freq:$(cat $vicfreqctrl/max_freq)" >> $clkfile
             echo "$vicctrl/power/control:$(cat $vicctrl/power/control)" >> $clkfile
         fi
     fi
  
     if [ ! -e $pwrfile ]; then
         echo $(nvpmodel -q | tail -n1) > $pwrfile
     fi
  
     nvpmodel -m 0
  
     jetson_clocks --fan
     jetson_clocks
  
     if [ -n "$vicctrl" ]; then
         echo on > $vicctrl/power/control
         echo userspace > $vicfreqctrl/governor
         sleep 1
         maxfreq=$(cat $vicfreqctrl/available_frequencies | rev | cut -f1 -d' ' | rev)
         echo $maxfreq > $vicfreqctrl/max_freq
         echo $maxfreq > $vicfreqctrl/userspace/set_freq
     fi
 }
  
 restore()
 {
     if [ -e $clkfile ]; then
         jetson_clocks --restore $clkfile > /dev/null 2>&1
     fi
  
     if [ -e $pwrfile ]; then
         nvpmodel -m $(cat $pwrfile)
     fi
 }
  
 action="$1"
  
 case "$action" in
     --restore)
         restore
         ;;
     --max)
         maxclocks
         ;;
     *)
         echo "Unknown option '$action'."
         echo "Usage: $(basename $0) <--max|--restore>"
         exit 1
         ;;
 esac

To maximize the clock frequencies, run:

sudo ./clocks.sh --max

Once the measurements are done, restore the clock frequencies with:

sudo ./clocks.sh --restore

What follows is a list of all devices used for measurement, along with their frequency and power configurations.

NVIDIA® Jetson AGX Thor™

CPU: 14-core Arm® Neoverse®-V3AE 64-bit CP running at 2.7 GHz
EMC freq.: 4.3 GHz
GPU freq.: 1.6 GHz
PVA/VPS freq.: 1.2 GHz
PVA/AXI freq.: 909 MHz
VIC freq.: 1.1 GHz
OFA (gpu-nvd-0): 1.7 GHZ
Power mode: MAXN
Fan speed: MAX

VPI - Vision Programming Interface

4.0 Release

Algorithm Performance Tables

Comparing Algorithm Elapsed Times

Benchmarking Method

Clock Frequency and Power Settings

NVIDIA® Jetson AGX Thor™