VPI - Vision Programming Interface

1.2 Release

Performance Benchmark

This page explains how VPI performance information is gathered, how to interpret it and how you can reproduce the results.

For performance comparisons between VPI and other Computer Vision libraries, please consult Performance Comparison page.

Algorithm Performance Tables

Most of the algorithms' description pages have a section where the running time of a single call/iteration is shown, along with the parameters used. This information can be helpful to performance-critical applications, allowing the user to evaluate the impact of certain parameter/backend combinations in application performance, such as:

  • What NVIDIA® Jetson™ device to use
  • What backend to use
  • Speed/quality trade-offs
  • Effect of running algorithms in parallel streams.

In each table, the target Jetson device and the number of parallel algorithm invocations can be selected, i.e., the number of parallel VPIStream executing the algorithm. This allows analysis of how the average running time of each algorithm invocation changes with the number of parallel algorithm executions. Say one VPIStream is set up to apply a box filter on an image. This operation alone takes a certain amount of time. Now suppose the program must process several images in parallel, each one in its own VPIStream. How does the average running time of each box filter operation change with the number of parallel streams? This is answered by changing the stream count in the performance table.

For most backends and algorithms, the average running time of each invocation increases linearly with the number of streams. A notable exception happens with the PVA backend. Since NVIDIA ®Jetson Xavier™ devices have two independent PVA processors, each one with two parallel vector processors, the PVA backend is fully utilized only when four or more parallel VPIStream are being used. Hence executing one to four parallel PVA algorithm instances won't increase the elapsed running time of each instance.

Although VPI can be used in x86 architectures with discrete GPUs, the large number of configurations with different performances makes it difficult to make a useful comparison between them and the different algorithms VPI supports. For this reason, benchmarking is restricted to Jetson devices.

Comparing Algorithm Elapsed Times

All algorithm elapsed time measurements are shown with their confidence intervals, such as \(212.4 \pm 0.4\) ms or \(0.024 \pm 0.002\) ms. These intervals represent a confidence level of 99.73% ( \(\pm 3\sigma\)) that the true elapsed time lies inside it. This assumes that the measurements are drawn from a normal distribution.

The confidence intervals must be taken into account when comparing measurements.

Example:

  • \(110 \pm 5\) ms and \(100 \pm 20\) ms represent similar elapsed times since there's an overlap between the confidence intervals \((105;115)\) and \((80;120)\), respectively. It cannot be said that the first measurement is effectively higher (or slower) than the second.
  • \(110 \pm 5\) ms and \(100 \pm 1\) ms represent different elapsed times with a high confidence as there's no overlap between \((105;115)\) and \((99;101)\), respectively. The first measurement can be considered higher than the second.

In order to make meaningful elapsed time comparison across different backends, the corresponding processors must be fully utilized. As an example, consider the Convolution performance table. When comparing CPU and PVA on a NVIDIA® Jetson AGX Xavier™ with one stream, CPU is faster since it has eight cores, and each algorithm invocation fully utilizes all of them, whereas PVA is only using one vector processor from one PVA processor, 1/4th of the installed PVA capacity. Now changing the number of parallel streams to four, PVA elapsed time increases just a bit, but CPU increases roughly 4x. In this context PVA is faster than CPU.

The conclusion is that the PVA backend scales better than other backends with more parallel streams until the processor is saturated. On top of that, it's also more energy efficient.

Benchmarking Method

The benchmark procedure used to measure the performance numbers is described in detail here. This information helps explain what context the performance numbers refer to.

  1. All payloads, inputs and output memory buffers are created beforehand.
  2. Memories have only the benchmarked backend enabled along with VPI_EXCLUSIVE_STREAM_ACCESS flag on.
  3. One second warm-up time running the algorithm in a loop.
  4. Run the algorithm in batches and measure its average running time within each batch. The number of calls in a batch varies with the approximate running time (faster algorithms, larger batch, max 100 calls). This is done to exclude the time spent performing the measurement itself from the algorithm runtime.
  5. Perform item 3 for at least 5s, making sure that we do it at least 10 times.
  6. From all average running times of each batch, we exclude the 5% lowest and highest values.
  7. From the result set, we take the median. This is the value used as final run time for the algorithm.

The Benchmarking sample demonstrates how a simplified version of the above is implemented. It can be used to get a good approximation of the values shown in the benchmark tables of each algorithm.

Clock Frequency and Power Settings

To make the measurements somewhat stable across runs, the device's frequency and power parameters were maxed out prior to benchmarking. This also mimics the situation where the system is under full load, thus making execution time closer to the lower bound. In real applications, depending on the system load, the execution time might be longer due to frequency throttling and other effects.

Jetson devices flashed via the JetPack installer come with utilities that can be used to control clock frequencies and power settings of available processors. In order to simplify the process, the script below will be used to set CPU, GPU, PVA, VIC and EMC (memory controller) clock frequencies to maximum.

Copy the script below to a file named clocks.sh in the device and set its attribute for execution with chmod +x clocks.sh.

1 #!/bin/bash -e
2 
3 if [ $(whoami) != root ]; then
4  echo Error: Run this script as a root user
5  exit 1
6 fi
7 
8 clkfile=/tmp/defclocks.conf
9 pwrfile=/tmp/defpower.conf
10 
11 # Nano?
12 if [ -e /sys/devices/50000000.host1x/54340000.vic ]; then
13  vicctrl=/sys/devices/50000000.host1x/54340000.vic
14  vicfreqctrl=$vicctrl/devfreq/54340000.vic
15 # Others?
16 elif [ -e /sys/devices/13e10000.host1x/15340000.vic ]; then
17  vicctrl=/sys/devices/13e10000.host1x/15340000.vic
18  vicfreqctrl=$vicctrl/devfreq/15340000.vic
19 fi
20 
21 maxclocks()
22 {
23  if [ ! -e $clkfile ]; then
24  jetson_clocks --store $clkfile
25  if [ -n "$vicctrl" ]; then
26  echo "$vicfreqctrl/governor:$(cat $vicfreqctrl/governor)" >> $clkfile
27  echo "$vicfreqctrl/max_freq:$(cat $vicfreqctrl/max_freq)" >> $clkfile
28  echo "$vicctrl/power/control:$(cat $vicctrl/power/control)" >> $clkfile
29  fi
30  fi
31 
32  if [ ! -e $pwrfile ]; then
33  echo $(nvpmodel -q | tail -n1) > $pwrfile
34  fi
35 
36  nvpmodel -m 0
37 
38  jetson_clocks --fan
39  jetson_clocks
40 
41  if [ -n "$vicctrl" ]; then
42  echo on > $vicctrl/power/control
43  echo userspace > $vicfreqctrl/governor
44  sleep 1
45  maxfreq=$(cat $vicfreqctrl/available_frequencies | rev | cut -f1 -d' ' | rev)
46  echo $maxfreq > $vicfreqctrl/max_freq
47  echo $maxfreq > $vicfreqctrl/userspace/set_freq
48  fi
49 }
50 
51 restore()
52 {
53  if [ -e $clkfile ]; then
54  jetson_clocks --restore $clkfile > /dev/null 2>&1
55  fi
56 
57  if [ -e $pwrfile ]; then
58  nvpmodel -m $(cat $pwrfile)
59  fi
60 }
61 
62 action="$1"
63 
64 case "$action" in
65  --restore)
66  restore
67  ;;
68  --max)
69  maxclocks
70  ;;
71  *)
72  echo "Unknown option '$action'."
73  echo "Usage: $(basename $0) <--max|--restore>"
74  exit 1
75  ;;
76 esac

To maximize the clock frequencies, run:

sudo ./clocks.sh --max

Once the measurements are done, restore the clock frequencies with:

sudo ./clocks.sh --restore

What follows is a list of all devices used for measurement, along with their frequency and power configurations.

NVIDIA® Jetson AGX Xavier™

  • CPU: 8x ARMv8 Processor rev 0 (v8l) running at 2.2656 GHz
  • EMC freq.: 2.133 GHz
  • GPU freq.: 1.377 GHz
  • PVA/VPS freq.: 1.088 GHz
  • PVA/CORE freq.: 844.8 MHz
  • VIC freq.: 1.0368 GHz
  • Power mode: MAXN
  • Fan speed: MAX

NVIDIA® Jetson™ TX2

  • CPU: 6x ARMv8 Processor rev 3 (v8l) running at 2.0352 GHz
  • EMC freq.: 1.866 GHz
  • GPU freq.: 1.3005 GHz
  • VIC freq.: 1.024 GHz
  • Power mode: MAXN
  • Fan speed: MAX

NVIDIA® Jetson Nano™

  • CPU: 4x ARMv8 Processor rev 1 (v8l) running at 1.479 GHz
  • EMC freq.: 1.6 GHz
  • GPU freq.: 921.6 MHz
  • VIC freq.: 627.2 MHz
  • Power mode: MAXN
  • Fan speed: MAX