Basic System Health Checks#
To confirm that your system is healthy and is correctly configured, check the compute performance and memory bandwidth with some simple benchmarks.
STREAM#
Use the STREAM benchmark to check LPDDR5X memory bandwidth. The following commands download and compile STREAM with a total memory footprint of approximately 2.7 GB, which is sufficient to exceed the L3 cache without excessive runtime.
Note
We recommend GCC version 12.3 or later.
$ wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c
$ gcc -Ofast -mcpu=neoverse-v2 -fopenmp \
-DSTREAM_ARRAY_SIZE=120000000 -DNTIMES=200 \
-o stream_openmp.exe stream.c
To run STREAM, set the number of OpenMP threads (OMP_NUM_THREADS
) according to the following example. To distribute the threads evenly over all available cores and maximize bandwidth, use OMP_PROC_BIND=spread
.
$ OMP_NUM_THREADS={THREADS} OMP_PROC_BIND=spread ./stream_openmp.exe
System bandwidth is proportional to the memory capacity. Locate your system’s memory capacity in the following table and use the given parameters to generate the expected score for STREAM TRIAD. For example, when running on a Grace-Hopper Superchip with a memory capacity of 120 GB, this command will score at least 450 GB/s in STREAM TRIAD:
$ OMP_NUM_THREADS=72 OMP_PROC_BIND=spread ./stream_openmp.exe
Similarly, the following command will score at least 900 GB/s in STREAM TRIAD on a Grace CPU Superchip with a memory capacity of 240 GB:
$ OMP_NUM_THREADS=144 OMP_PROC_BIND=spread numactl -m0,1 ./stream_openmp.exe
Superchip |
Capacity (GB) |
NUM_THREADS** |
Values |
---|---|---|---|
Grace-Hopper |
120 |
72 |
450+ |
Grace-Hopper |
480 |
72 |
340+ |
Grace CPU |
240 |
144 |
900+ |
Grace CPU |
480 |
144 |
900+ |
Grace CPU |
960 |
144 |
680+ |
$ OMP_NUM_THREADS=72 OMP_PROC_BIND=spread numactl -m0,1 ./stream_openmp.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 120000000 (elements), Offset = 0 (elements)
Memory per array = 915.5 MiB (= 0.9 GiB).
Total memory required = 2746.6 MiB (= 2.7 GiB).
Each kernel will be executed 200 times.
The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 72
Number of Threads counted = 72
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 2927 microseconds.
(= 2927 clock ticks)
Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 919194.6 0.002149 0.002089 0.002228
Scale: 913460.0 0.002137 0.002102 0.002192
Add: 916926.9 0.003183 0.003141 0.003343
Triad: 903687.9 0.003223 0.003187 0.003308
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Fused Multiply Add#
NVIDIA provides an open source suite of benchmarking microkernels for Arm CPUs. To allow precise counts of instructions and exercise specific functional units, these kernels are written in assembly language. To measure the peak floating point capability of a core and check the CPU clock speed, use a Fused Multiply Add (FMA) kernel.
To measure achievable peak performance of a core, the fp64_sve_pred_fmla
kernel executes a known number of SVE predicated fused multiply-add
operations (FMLA). When combined with the perf tool, you can measure the performance and the core clock speed.
$ git clone https://github.com/NVIDIA/arm-kernels.git
$ cd arm-kernels
$ make
$ perf stat ./arithmetic/fp64_sve_pred_fmla.x
The benchmark score is reported in giga-operations per second (Gop/sec) near the top of the benchmark output. Grace can perform 16 FP64 FMA operations per cycle, so a Grace CPU with a nominal CPU frequency of 3.3 GHz will report between 52 and 53 Gop/sec. The CPU frequency is reported in the perf output on the cycles line and after the # symbol.
Here is an example of the fp64_sve_pred_fmla.x execution output:
$ perf stat ./arithmetic/fp64_sve_pred_fmla.x
4( 16(SVE_FMLA_64b) );
Iterations;100000000
Total Inst;6400000000
Total Ops;25600000000
Inst/Iter;64
Ops/Iter;256
Seconds;0.481267
GOps/sec;53.1929
Performance counter stats for './arithmetic/fp64_sve_pred_fmla.x':
482.25 msec task-clock # 0.996 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
65 page-faults # 134.786 /sec
1,607,949,685 cycles # 3.334 GHz
6,704,065,953 instructions # 4.17 insn per cycle
<not supported> branches
18,383 branch-misses # 0.00% of all branches
0.484136320 seconds time elapsed
0.482678000 seconds user
0.000000000 seconds sys
C2C CPU-GPU Bandwidth#
NVIDIA provides an open-source benchmark, similar to STREAM, that is designed to test the bandwidth between various memory units on the system. This can be used to test the bandwidth provided by NVLink C2C between the CPU and GPU of a Grace Hopper Superchip.
Download, build, and run nvbandwidth:
git clone https://github.com/NVIDIA/nvbandwidth.git
cd nvbandwidth
# may need to update version of CUDA
docker run -it --rm --gpus all -v $(pwd):/nvbandwidth
nvidia/cuda:12.2.0-devel-ubuntu22.04
# within docker
cd /nvbandwidth
apt update
apt install libboost-program-options-dev
./debian_install.sh
./nvbandwidth -t 0
# next test
./nvbandwidth -t 1
# all tests can be listed with ./nvbandwidth -l
Here is the output from the previous two commands on a sample system:
Note
Bandwidth numbers depend on specific Grace Hopper SKUs and are also influenced by factors such as IOMMU settings, GPU clock settings, and other system-specific parameters. These factors should be carefully considered during any bandwidth benchmarking activity.
# ./nvbandwidth -t 0
nvbandwidth Version: v0.2
Built from Git version:
NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.
CUDA Runtime Version: 12020
CUDA Driver Version: 12020
Driver Version: 535.82
Device 0: GH200 120GB
Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0
0 416.34
SUM host_to_device_memcpy_ce 416.34
# ./nvbandwidth -t 1
nvbandwidth Version: v0.2
Built from Git version:
NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal
peak bandwidth.
CUDA Runtime Version: 12020
CUDA Driver Version: 12020
Driver Version: 535.82
Device 0: GH200 120GB
Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0
0 295.47
SUM device_to_host_memcpy_ce 295.47
For memory copies that use CUDA copy engines (CEs), you can expect values similar to the ones in the output for systems with 120 GB or 240 GB of LPDDR5 memory. When compared to the first test output shown above, systems with 480 GB LPDDR5 memory might have a lower bandwidth for host-to-device copies. On a healthy system, this bandwidth should be approximately 350-360 GB/s. Systems with 480 GB LPDDR5 should have similar device-to-host bandwidth as shown above in the second test, except for Grace-Hopper x4 systems, where this bandwidth should be approximately 170 GB/s due to more CEs being reserved for saturating NVLink bandwidth between GPUs. To run bandwidth tests using the GPU’s streaming micro-processors (SMs), run the ./nvbandwidth -l command for the exact test numbers. The achieved bandwidth should be at least as large as the outputs shown by CE-based tests.