NVIDIA HPCG Benchmark#

x86 package folder structure#

hpcg.sh script in the root directory of the package to invoke the xhpcg executable.

NVIDIA HPCG benchmark in the folder ./hpcg-linux-x86_64 contains:

  • xhpcg executable

  • Samples of Slurm batch-job scripts in sample-slurm directory

  • Samples of input files in sample-dat directory

  • README, RUNNING, and TUNING guides

aarch64 package folder structure#

hpcg-aarch64.sh script in the root directory of the package to invoke the xhpcg executable for NVIDIA Grace CPU.

  • NVIDIA HPCG Benchmark in the folder ./hpcg-linux-aarch64 contains:

    • xhpcg executable for NVIDIA Grace Hopper and NVIDIA Grace Blackwell

    • xhpcg-cpu executable for NVIDIA Grace CPU

    • Samples of Slurm batch-job scripts in sample-slurm directory

    • Sample input file in sample-dat directory

    • README, RUNNING, and TUNING guides

Running the NVIDIA HPCG Benchmarks on x86_64 with NVIDIA GPUs#

The NVIDIA HPCG Benchmark uses the same input format as the standard HPCG Benchmark. Please see the HPCG Benchmark for getting started with the HPCG software concepts and best practices.

The script hpcg.sh can be invoked on a command line or through a Slurm batch-script to launch the NVIDIA HPCG Benchmarks.

The script hpcg.sh accepts the following parameters:

  • --dat <string> path to hpcg.dat (Required)

  • --cuda-compat manually enable CUDA forward compatibility (Optional)

  • --no-multinode enable flags for no-multinode (no-network) execution (Optional)

  • --gpu-affinity <string> colon separated list of gpu indices (Optional)

  • --cpu-affinity <string> colon separated list of cpu index ranges (Optional)

  • --mem-affinity <string> colon separated list of memory indices (Optional)

  • --ucx-affinity <string> colon separated list of UCX devices (Optional)

  • --ucx-tls <string> UCX transport to use (Optional)

  • --exec-name <string> HPL executable file (Optional)

In addition, the script hpcg.sh alternatively to input file accepts the following parameters:

  • --nx specifies the local (to an MPI process) X dimensions of the problem

  • --ny specifies the local (to an MPI process) Y dimensions of the problem

  • --nz specifies the local (to an MPI process) Z dimensions of the problem

  • --rt specifies the number of seconds of how long the timed portion of the benchmark should run

  • --b activates benchmarking mode to bypass CPU reference execution when set to one (–b 1)

  • --l2cmp activates compression in GPU L2 cache when set to one (–l2cmp 1)

  • --of activates generating the log into textfiles, instead of console (–of=1)

  • --gss specifies the slice size for the GPU rank (default is 2048)

  • --p2p specifies the p2p comm mode: 0 MPI_CPU, 1 MPI_CPU_All2allv, 2 MPI_CUDA_AWARE, 3 MPI_CUDA_AWARE_All2allv, 4 NCCL. Default MPI_CPU

  • --npx specifies the process grid X dimension of the problem

  • --npy specifies the process grid Y dimension of the problem

  • --npz specifies the process grid Z dimension of the problem

Notes:

  • Affinity example:

    • DGX-H100 and DGX-B200:

      --mem-affinity 0:0:0:0:1:1:1:1 --cpu-affinity 0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111
      
    • DGX-A100:

      --mem-affinity 2:3:0:1:6:7:4:5 --cpu-affinity 32-47:48-63:0-15:16-31:96-111:112-127:64-79:80-95
      
  • The NVIDIA HPCG benchmark requires one GPU per MPI process. Therefore, ensure that the number of MPI processes is set to match the number of available GPUs in the cluster.

  • Multi-Instance GPU (MIG) technology can help improve HPCG benchmark performance on Nvidia Blackwell with dual-die design.
    • Use MIG to partition the Blackwell GPU into two instances. These two instances can be treated as stand-alone GPUs.

    • The following example shows a template script to run HPCG on a single DGX Blackwell node with 8 GPUs, partitioned into 16 MIG intances.

      # Note: sudo access is needed to configure MIG
      # Activate MIG on each GPU and partition each GPU into two instances
      for i in {0..7}
      do
          sudo nvidia-smi -i $i -mig 1
          sudo nvidia-smi mig -i $i  -cgi 9,9 -C
      done
      
      # Make the 16 instances visible
      export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
      
      #Run HPCG with 16 GPUs/node
      ...
      
      # Deactivate MIG and instances
      for i in {0..7}
      do
          sudo nvidia-smi mig -i $i  -dci
          sudo nvidia-smi mig -i $i -dgi
          sudo nvidia-smi -i $i -mig 0
      done
      

Examples:

Run NVIDIA HPCG on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using script parameters on x86:

srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
    ./hpcg.sh --nx 256 --ny 256 --nz 256 --rt 2

Running the NVIDIA HPCG Benchmarks on NVIDIA Grace Hopper, NVIDIA Grace Blackwell, and NVIDIA Grace CPU systems#

The NVIDIA HPCG Benchmark uses the same input format as the standard HPCG Benchmark. Please see the HPCG Benchmark for getting started with the HPCG software concepts and best practices.

NVIDIA HPCG Benchmark supports CPU-only, GPU-only and heterogeneous execution modes

The script hpcg-aarch64.sh can be invoked on a command line or through a Slurm batch-script to launch the NVIDIA HPCG Benchmarks.

The script hpcg-aarch64.sh accepts the following parameters:

  • --dat <string> path to HPL.dat (Required)

  • --cuda-compat manually enable CUDA forward compatibility (Optional)

  • --no-multinode enable flags for no-multinode (no-network) execution (Optional)

  • --gpu-affinity <string> colon separated list of gpu indices (Optional)

  • --cpu-affinity <string> colon separated list of cpu index ranges (Optional)

  • --mem-affinity <string> colon separated list of memory indices (Optional)

  • --ucx-affinity <string> colon separated list of UCX devices (Optional)

  • --ucx-tls <string> UCX transport to use (Optional)

  • --exec-name <string> HPL executable file (Optional)

In addition, instead of an input file, the script hpcg-aarch64.sh`` accepts the following parameters:

  • --nx specifies the local (to an MPI process) X dimensions of the problem

  • --ny specifies the local (to an MPI process) Y dimensions of the problem

  • --nz specifies the local (to an MPI process) Z dimensions of the problem

  • --rt specifies the number of seconds of how long the timed portion of the benchmark should run

  • --b activates benchmarking mode to bypass CPU reference execution when set to one (–b=1)

  • --l2cmp activates compression in GPU L2 cache when set to one (–l2cmp=1)

  • --of activates generating the log into textfiles, instead of console (–of=1)

  • --gss specifies the slice size for the GPU rank (default is 2048)

  • --css specifies the slice size for the CPU rank (default is 8)

  • --p2p specifies the p2p comm mode: 0 MPI_CPU, 1 MPI_CPU_All2allv, 2 MPI_CUDA_AWARE, 3 MPI_CUDA_AWARE_All2allv, 4 NCCL. Default MPI_CPU

The following parameters controls the NVIDIA HPCG benchmark on NVIDIA Grace Hopper and NVIDIA Grace Blackwell systems:

  • --exm specifies the execution mode. 0 is GPU-only, 1 is Grace-only, and 2 is heterogeneous. Default is 0

  • --ddm specifies the direction that GPU and Grace will not have the same local dimension. 0 is auto, 1 is X, 2 is Y, and 3 is Z. Default is 0. Note that the GPU and Grace local problems can differ in one dimension only

  • --g2c specifies the value of different dimensions of the GPU and Grace local problems. Depends on --ddm and --lpm values

  • --lpm controls the meaning of the value provided for --g2c parameter. Applicable when --exm is 2 and depends on the different local dimension specified by --ddm

    Value Explanation:

    • 0 means nx/ny/nz are GPU local dims and g2c value is the ratio of GPU dim to Grace dim. For example, --nx 128 --ny 128 --nz 128 --ddm 2 --g2c 8 means the different Grace dim (Y in this example) is 1/8 the different GPU dim. GPU local problem is 128x128x128 and Grace local problem is 128x16x128

    • 1 means nx/ny/nz are GPU local dims and g2c value is the absolute value for the different dim for Grace. For example, --nx 128 --ny 128 --nz 128 --ddm 3 --g2c 64`` means the different Grace dim (Z in this example) is 64. GPU local problem is 128x128x128 and Grace local problem is 128x128x64

    • 2 assumes a local problem formed by combining a GPU and a Grace problems. The value 2 means the sum of the different dims of the GPU and Grace is combined in the different dimension value. --g2c`` is the ratio. For example, --ddm 1 --nx 1024 --g2c 8, then GPU X dim is 896 and Grace X dim is 128

    • 3 assumes a local problem formed by combining a GPU and a Grace problems. The value 3 means the sum of the different dims of the GPU and Grace is combined in the different dimension value. --g2c is absolute. For example, --ddm 1 --nx 1024 --g2c 96 then GPU X dim is 928 and Grace X dim is 96

Optional parameters of hpcg-aarch64.sh script:

  • --npx specifies the process grid X dimension of the problem

  • --npy specifies the process grid Y dimension of the problem

  • --npz specifies the process grid Z dimension of the problem

Heterogenous (GPU-GRACE) execution mode in-depth#

The NVIDIA HPCG benchmark can run efficiently on heterogeneous systems comprising GPUs and Grace CPUs like NVIDIA Grace Hopper and NVIDIA Grace Blackwell. The approach involves assigning an MPI rank to each GPU and one or more MPI ranks to the Grace CPU. Given that the GPU performs significantly faster than the Grace CPU, the strategy is to allocate a larger local problem size to the GPU compared to the Grace CPU. This ensures that during MPI blocking communication steps like MPI_Allreduce, the GPU’s execution is not interrupted by the Grace CPU’s slower execution.

In the NVIDIA HPCG benchmark, the GPU and Grace local problems are configured to differ in only one dimension while keeping the other dimensions identical. This design enables proper halo exchange operations across the dimensions that remain identical between the GPU and Grace ranks. The image below depicts an example of this design. The GPU and Grace ranks have the same x and y dimensions, where the halo exchange takes place. The z dimension is different which enables assigning different local problems for the GPU and Grace ranks. The NVIDIA HPCG benchmark has the flexibility to choose the 3D shape of ranks choose the different dimension, and configure the sizes of GPU and Grace ranks.

workflow

Notes:

  • The NVIDIA HPCG benchmark requires one GPU per MPI process. Therefore, ensure that the number of MPI processes is set to match the number of available GPUs in the cluster.

  • It is recommended to bind MPI process to NUMA node on NVIDIA Grace CPU, for example:

    ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1
    

Examples:

Run NVIDIA HPCG on two nodes of NVIDIA Grace CPU using custom parameters:

srun -N 2 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
    ./hpcg-aarch64.sh --exm 1 --nx 512 --ny 512 --nz 288 --rt 30 --cpu-affinity 0-35:36-71:72-107:108-143 --mem-affinity 0:0:1:1

Run NVIDIA HPCG on two nodes NVIDIA Grace Hopper x4:

#GPU+Grace (Heterogeneous execution)
#GPU rank has 8 OpenMP threads and Grace rank has 64 OpenMP threads
srun -N 2 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
    ./hpcg-aarch64.sh --nx 256 --ny 1024 --nz 288 --rt 2 \
    --exm 2 --ddm 2 --lpm 1 --g2c 64 \
    --npx 4 --npy 4 --npz 1 \
    --cpu-affinity 0-7:8-71:72-79:80-143:144-151:152-215:216-223:224-287 \
    --mem-affinity 0:0:1:1:2:2:3:3

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.