NVIDIA HPL-MxP Benchmark#

x86 package folder structure#

hpl-mxp.sh script in the root directory of the package to invoke the xhpl_mxp executable.

NVIDIA HPL-MxP Benchmark in the folder ./hpl-mxp-linux-x86_64 contains:

  • xhpl_mxp executable

  • Samples of Slurm batch-job scripts in sample-slurm directory

  • README, RUNNING, and TUNING guides

aarch64 package folder structure#

hpl-mxp-aarch64.sh script in the root directory of the package to invoke the xhpl-mxp executable NVIDIA Grace CPU

hpl-mxp.sh script in the root directory of the package to invoke the xhpl-mxp executable for NVIDIA Grace Hopper and NVIDIA Grace Blackwell.

  • NVIDIA HPL-MxP Benchmark in the folder ./hpl-mxp-linux-aarch64 contains:

    • xhpl_mxp executable for NVIDIA Grace CPU

    • Samples of Slurm batch-job scripts in sample-slurm directory

    • README, and RUNNING guides

  • NVIDIA HPL-MxP Benchmark in the folder ./hpl-mxp-linux-aarch64-gpu contains:

    • xhpl_mxp executable for NVIDIA Grace Hopper and NVIDIA Grace Blackwell

    • Samples of Slurm batch-job scripts in sample-slurm directory

    • README, RUNNING, and TUNING guides

Running the NVIDIA HPL-MxP Benchmarks on x86_64 with NVIDIA GPUs, NVIDIA Grace Hopper, and NVIDIA Grace Blackwell systems#

The NVIDIA HPL-MxP Benchmark accepts the list of parameters to describe input tasks and set additional tuning settings.

The script hpl-mxp.sh can be invoked on a command line or through a Slurm batch script to launch the HPL-MxP-NVIDIA benchmark.

The script hpl-mxp.sh requires the following parameters:

  • --gpu-affinity <string> colon separated list of GPU indices

  • --nprow <int> number of rows in the processor grid

  • --npcol <int> number of columns in the processor grid

  • --nporder <string> “row” or “column” major layout of the processor grid

  • --n <int> size of N-by-N matrix

  • --nb <int> nb is the blocking constant (panel size)

Optional parameters:

  • --cpu-affinity <string> colon separated list of cpu index ranges

  • --mem-affinity <string> colon separated list of memory indices

  • --ucx-affinity <string> colon separated list of UCX devices

  • --ucx-tls <string> UCX transport to use

  • --exec-name <string> HPL-MxP executable file

In addition, the script hpl-mxp.sh accepts the following tuning parameters:

  • --tolerance <float> tolerance of the HPL harness (default 1e-12)

  • --test-loop <int> number of test loops to run (default 1)

  • --preset-gemm-kernel <int> type of preset gemm kernel to use: 0 (none) or 80 (SM80) (default 0)

  • --u-panel-chunk-nbs <int> U panel chunk size given in unit of NBs (default 8)

  • --call-dgemv-with-multiple-threads <int> number of rows each host thread works on if calling dgemv with multiple threads: 0 indicates using only one thread to call dgemv (default 0)

  • --prioritize-trsm <int> whether GEMMs wait for U TRSMs (default 0)

  • --prioritize-factorization <int> whether GEMMs wait for factorizations (default 0)

  • --use-separate-stream-for-gemm <int> whether using a separate stream for GEMMs (default 1)

  • --use-mpi-panel-broadcast <int> percent of steps using MPI: e.g., 30 means first 30 percent of steps use MPI. If value is 0, NCCL will be used (default 1)

  • --sloppy-type <int> size of the sloppy type: 1 (FP8) or 2 (FP16) (default 2)

  • --Anq-device <int> number of columns of the FP64 matrix that are placed on the device (default 0)

  • --fill-device <int> whether or not fill device with the FP64 matrix, this option overrides --Anq-device (default 0)

  • --fill-device-buffer-size <int> when --fill-device=1, specifies the size of buffer zone should be left on the device (in unit of MB) (default 3048)

  • --cuda-host-register-step <int> number of columns of the FP64 matrix that are cudaHostRegister’d at a time (default 2048)

  • --mpi-use-mpi <int> fall back to use MPI_Bcast (default 0)

  • --use-host-mpi do not use CUDA-aware MPI

  • --monitor-gpu <int> whether monitoring GPUs during the run (default 0)

  • --monitor-gpu-interval <float> time interval with which GPUs are monitored [seconds] (default 1)

  • --monitor-gpu-clock-warning <int> GPU clock below which warning is given [MHz] (default 0)

  • --monitor-gpu-power-warning <int> GPU power above which warning is given [W] (default 0)

  • --monitor-gpu-temp-warning <int> GPU temperature above which warning is given [C] (default 0)

  • --monitor-gpu-pcie-width-warning <int> PCIe width below which warning is given (default 0)

  • --monitor-gpu-pcie-gen-warning <int> PCIe gen below which warning is given (default 0)

  • -h,--help Print this help message and exit

Notes:

  • Affinity example:

    • DGX-H100 and DGX-B200:

      --mem-affinity 0:0:0:0:1:1:1:1 --cpu-affinity 0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111
      
    • DGX-A100:

      --mem-affinity 2:3:0:1:6:7:4:5 --cpu-affinity 32-47:48-63:0-15:16-31:96-111:112-127:64-79:80-95
      
  • The NVIDIA HPL-MxP benchmark expects one GPU per MPI process. As such, set the number of MPI processes to match the number of available GPUs in the cluster.

Examples:

Run NVIDIA HPL-MxP on a 4 nodes, each node with 8 GPUs:

srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
    ./hpl-mxp.sh --n 280000 --nb 2048 --nprow 8 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7

Run NVIDIA HPL-MxP on a 4 nodes of NVIDIA Grace Hopper:

srun -N 4 --ntasks-per-node=1 \
    ./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 2 --nporder row \
    --cpu-affinity 0-71 --mem-affinity 0 --gpu-affinity 0

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Running the NVIDIA HPL-MxP Benchmarks on NVIDIA Grace CPU only systems#

The script hpl-mxp-aarch64.sh can be invoked on a command line or through a Slurm batch script to launch the NVIDIA HPL-MxP benchmark for NVIDIA Grace CPU.

The script hpl-mxp-aarch64.sh requires the following parameters:

  • --nprow <int> number of rows in the processor grid

  • --npcol <int> number of columns in the processor grid

  • --nporder <string> “row” or “column” major layout of the processor grid

  • --n <int> size of N-by-N matrix

  • --nb <int> nb is the blocking constant (panel size) The full list of accepted parameters can be found in README and TUNING files

Optional parameters:

  • --cpu-affinity <string> colon separated list of cpu index ranges

  • --mem-affinity <string> colon separated list of memory indices

  • --exec-name <string> HPL-MxP executable file

  • --tolerance <float> Tolerance of the HPL harness. [Default value = 1e-12]

  • --u-panel-chunk-nbs <int> U panel chunk size given in unit of NBs. [Default value = 8]

Notes:

  • It is recommended to bind MPI process to NUMA node on NVIDIA Grace CPU, for example:

    ./hpl-mxp-aarch64.sh --n 16384 --nb 1023 --nprow 1 --npcol 1 --cpu-affinity 0-71:72-143 --mem-affinity 0:1
    

Examples:

Run NVIDIA HPL-MxP on a 4 nodes of NVIDIA Grace CPU:

srun -N 4 --ntasks-per-node=2 --cpu-bind=none --mpi=pmix \
    ./hpl-mxp-aarch64.sh --n 180000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
    --cpu-affinity 0-71:72-143 --mem-affinity 0:1

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.