NVIDIA HPL-MxP Benchmark#

x86 package folder structure#

hpl-mxp.sh script in the root directory of the package to invoke the xhpl_mxp executable.

NVIDIA HPL-MxP Benchmark in the folder ./hpl-mxp-linux-x86_64 contains:

xhpl_mxp executable

Samples of Slurm batch-job scripts in sample-slurm directory

README, RUNNING, and TUNING guides

aarch64 package folder structure#

hpl-mxp-aarch64.sh script in the root directory of the package to invoke the xhpl-mxp executable NVIDIA Grace CPU

hpl-mxp.sh script in the root directory of the package to invoke the xhpl-mxp executable for NVIDIA Grace Hopper and NVIDIA Grace Blackwell.

NVIDIA HPL-MxP Benchmark in the folder ./hpl-mxp-linux-aarch64 contains:
- xhpl_mxp executable for NVIDIA Grace CPU
- Samples of Slurm batch-job scripts in sample-slurm directory
- README, and RUNNING guides
NVIDIA HPL-MxP Benchmark in the folder ./hpl-mxp-linux-aarch64-gpu contains:
- xhpl_mxp executable for NVIDIA Grace Hopper and NVIDIA Grace Blackwell
- Samples of Slurm batch-job scripts in sample-slurm directory
- README, RUNNING, and TUNING guides

Running the NVIDIA HPL-MxP Benchmarks on x86_64 with NVIDIA GPUs, NVIDIA Grace Hopper, and NVIDIA Grace Blackwell systems#

The NVIDIA HPL-MxP Benchmark accepts the list of parameters to describe input tasks and set additional tuning settings.

The script hpl-mxp.sh can be invoked on a command line or through a Slurm batch script to launch the HPL-MxP-NVIDIA benchmark.

The script hpl-mxp.sh requires the following parameters:

--gpu-affinity <string> colon separated list of GPU indices

--nprow <int> number of rows in the processor grid

--npcol <int> number of columns in the processor grid

--nporder <string> “row” or “column” major layout of the processor grid

--n <int> size of N-by-N matrix

--nb <int> nb is the blocking constant (panel size)

Optional parameters:

--cpu-affinity <string> colon separated list of cpu index ranges

--mem-affinity <string> colon separated list of memory indices

--ucx-affinity <string> colon separated list of UCX devices

--ucx-tls <string> UCX transport to use

--exec-name <string> HPL-MxP executable file

In addition, the script hpl-mxp.sh accepts the following tuning parameters:

--tolerance <float> tolerance of the HPL harness (default 1e-12)

--test-loop <int> number of test loops to run (default 1)

--preset-gemm-kernel <int> type of preset gemm kernel to use: 0 (none) or 80 (SM80) (default 0)

--u-panel-chunk-nbs <int> U panel chunk size given in unit of NBs (default 8)

--call-dgemv-with-multiple-threads <int> number of rows each host thread works on if calling dgemv with multiple threads: 0 indicates using only one thread to call dgemv (default 0)

--prioritize-trsm <int> whether GEMMs wait for U TRSMs (default 0)

--prioritize-factorization <int> whether GEMMs wait for factorizations (default 0)

--use-separate-stream-for-gemm <int> whether using a separate stream for GEMMs (default 1)

--use-mpi-panel-broadcast <int> percent of steps using MPI: e.g., 30 means first 30 percent of steps use MPI. If value is 0, NCCL will be used (default 1)

--sloppy-type <int> size of the sloppy type: FP4, FP8 or 1, FP16 or 2 (default “FP16”)

--Anq-device <int> number of columns of the FP64 matrix that are placed on the device (default 0)

--fill-device <int> whether or not fill device with the FP64 matrix, this option overrides --Anq-device (default 0)

--fill-device-buffer-size <int> when --fill-device=1, specifies the size of buffer zone should be left on the device (in unit of MB) (default 3048)

--cuda-host-register-step <int> number of columns of the FP64 matrix that are cudaHostRegister’d at a time (default 2048)

--mpi-use-mpi <int> fall back to use MPI_Bcast (default 0)

--use-host-mpi do not use CUDA-aware MPI

--monitor-gpu <int> whether monitoring GPUs during the run (default 0)

--monitor-gpu-interval <float> time interval with which GPUs are monitored [seconds] (default 1)

--monitor-gpu-clock-warning <int> GPU clock below which warning is given [MHz] (default 0)

--monitor-gpu-power-warning <int> GPU power above which warning is given [W] (default 0)

--monitor-gpu-temp-warning <int> GPU temperature above which warning is given [C] (default 0)

--monitor-gpu-pcie-width-warning <int> PCIe width below which warning is given (default 0)

--monitor-gpu-pcie-gen-warning <int> PCIe gen below which warning is given (default 0)

-h,--help Print this help message and exit

Notes:

Affinity example:
DGX-H100 and DGX-B200:
--mem-affinity 0:0:0:0:1:1:1:1 --cpu-affinity 0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111
DGX-A100:
--mem-affinity 2:3:0:1:6:7:4:5 --cpu-affinity 32-47:48-63:0-15:16-31:96-111:112-127:64-79:80-95
The NVIDIA HPL-MxP benchmark expects one GPU per MPI process. As such, set the number of MPI processes to match the number of available GPUs in the cluster.

Performance of The NVIDIA HPL-MxP Benchmark highly depends on cuBLAS. The performance of cuBLAS continues to improve with every new release of CUDA. To ensure the best performance, use the latest CUDA release.

Examples:

Run NVIDIA HPL-MxP on a 4 nodes, each with 8 GPUs:
srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \
    ./hpl-mxp.sh --n 280000 --nb 2048 --nprow 8 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7
Run NVIDIA HPL-MxP on a 4 nodes of NVIDIA Grace Hopper:
srun -N 4 --ntasks-per-node=1 \
    ./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 2 --nporder row \
    --cpu-affinity 0-71 --mem-affinity 0 --gpu-affinity 0
where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

Running the NVIDIA HPL-MxP Benchmarks on NVIDIA Grace CPU only systems#

The script hpl-mxp-aarch64.sh can be invoked on a command line or through a Slurm batch script to launch the NVIDIA HPL-MxP benchmark for NVIDIA Grace CPU.

The script hpl-mxp-aarch64.sh requires the following parameters:

--nprow <int> number of rows in the processor grid

--npcol <int> number of columns in the processor grid

--nporder <string> “row” or “column” major layout of the processor grid

--n <int> size of N-by-N matrix

--nb <int> nb is the blocking constant (panel size) The full list of accepted parameters can be found in README and TUNING files

Optional parameters:

--cpu-affinity <string> colon separated list of cpu index ranges

--mem-affinity <string> colon separated list of memory indices

--exec-name <string> HPL-MxP executable file

--tolerance <float> Tolerance of the HPL harness. [Default value = 1e-12]

--u-panel-chunk-nbs <int> U panel chunk size given in unit of NBs. [Default value = 8]

Notes:

It is recommended to bind MPI process to NUMA node on NVIDIA Grace CPU, for example:
./hpl-mxp-aarch64.sh --n 16384 --nb 1023 --nprow 1 --npcol 1 --cpu-affinity 0-71:72-143 --mem-affinity 0:1

Examples:

Run NVIDIA HPL-MxP on a 4 nodes of NVIDIA Grace CPU:
srun -N 4 --ntasks-per-node=2 --cpu-bind=none --mpi=pmix \
    ./hpl-mxp-aarch64.sh --n 180000 --nb 2048 --nprow 2 --npcol 4 --nporder row \
    --cpu-affinity 0-71:72-143 --mem-affinity 0:1
where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.