NVIDIA HPL-MxP Benchmark#
x86 package folder structure#
hpl-mxp.sh
script in the root directory of the package to invoke the xhpl_mxp
executable.
NVIDIA HPL-MxP Benchmark in the folder ./hpl-mxp-linux-x86_64
contains:
xhpl_mxp
executableSamples of Slurm batch-job scripts in sample-slurm directory
README, RUNNING, and TUNING guides
aarch64 package folder structure#
hpl-mxp-aarch64.sh
script in the root directory of the package to invoke the xhpl-mxp
executable NVIDIA Grace CPU
hpl-mxp.sh
script in the root directory of the package to invoke the xhpl-mxp
executable for NVIDIA Grace Hopper and NVIDIA Grace Blackwell.
NVIDIA HPL-MxP Benchmark in the folder
./hpl-mxp-linux-aarch64
contains:xhpl_mxp
executable for NVIDIA Grace CPUSamples of Slurm batch-job scripts in sample-slurm directory
README, and RUNNING guides
NVIDIA HPL-MxP Benchmark in the folder
./hpl-mxp-linux-aarch64-gpu
contains:xhpl_mxp
executable for NVIDIA Grace Hopper and NVIDIA Grace BlackwellSamples of Slurm batch-job scripts in sample-slurm directory
README, RUNNING, and TUNING guides
Running the NVIDIA HPL-MxP Benchmarks on x86_64 with NVIDIA GPUs, NVIDIA Grace Hopper, and NVIDIA Grace Blackwell systems#
The NVIDIA HPL-MxP Benchmark accepts the list of parameters to describe input tasks and set additional tuning settings.
The script hpl-mxp.sh
can be invoked on a command line or through a Slurm batch script to launch the HPL-MxP-NVIDIA benchmark.
The script hpl-mxp.sh
requires the following parameters:
--gpu-affinity <string>
colon separated list of GPU indices
--nprow <int>
number of rows in the processor grid
--npcol <int>
number of columns in the processor grid
--nporder <string>
“row” or “column” major layout of the processor grid
--n <int>
size of N-by-N matrix
--nb <int>
nb is the blocking constant (panel size)
Optional parameters:
--cpu-affinity <string>
colon separated list of cpu index ranges
--mem-affinity <string>
colon separated list of memory indices
--ucx-affinity <string>
colon separated list of UCX devices
--ucx-tls <string>
UCX transport to use
--exec-name <string>
HPL-MxP executable file
In addition, the script hpl-mxp.sh
accepts the following tuning parameters:
--tolerance <float>
tolerance of the HPL harness (default 1e-12)
--test-loop <int>
number of test loops to run (default 1)
--preset-gemm-kernel <int>
type of preset gemm kernel to use: 0 (none) or 80 (SM80) (default 0)
--u-panel-chunk-nbs <int>
U panel chunk size given in unit of NBs (default 8)
--call-dgemv-with-multiple-threads <int>
number of rows each host thread works on if calling dgemv with multiple threads: 0 indicates using only one thread to call dgemv (default 0)
--prioritize-trsm <int>
whether GEMMs wait for U TRSMs (default 0)
--prioritize-factorization <int>
whether GEMMs wait for factorizations (default 0)
--use-separate-stream-for-gemm <int>
whether using a separate stream for GEMMs (default 1)
--use-mpi-panel-broadcast <int>
percent of steps using MPI: e.g., 30 means first 30 percent of steps use MPI. If value is 0, NCCL will be used (default 1)
--sloppy-type <int>
size of the sloppy type: 1 (FP8) or 2 (FP16) (default 2)
--Anq-device <int>
number of columns of the FP64 matrix that are placed on the device (default 0)
--fill-device <int>
whether or not fill device with the FP64 matrix, this option overrides--Anq-device
(default 0)
--fill-device-buffer-size <int>
when--fill-device=1
, specifies the size of buffer zone should be left on the device (in unit of MB) (default 3048)
--cuda-host-register-step <int>
number of columns of the FP64 matrix that are cudaHostRegister’d at a time (default 2048)
--mpi-use-mpi <int>
fall back to use MPI_Bcast (default 0)
--use-host-mpi
do not use CUDA-aware MPI
--monitor-gpu <int>
whether monitoring GPUs during the run (default 0)
--monitor-gpu-interval <float>
time interval with which GPUs are monitored [seconds] (default 1)
--monitor-gpu-clock-warning <int>
GPU clock below which warning is given [MHz] (default 0)
--monitor-gpu-power-warning <int>
GPU power above which warning is given [W] (default 0)
--monitor-gpu-temp-warning <int>
GPU temperature above which warning is given [C] (default 0)
--monitor-gpu-pcie-width-warning <int>
PCIe width below which warning is given (default 0)
--monitor-gpu-pcie-gen-warning <int>
PCIe gen below which warning is given (default 0)
-h,--help
Print this help message and exit
Notes:
Affinity example:
DGX-H100 and DGX-B200:
--mem-affinity 0:0:0:0:1:1:1:1 --cpu-affinity 0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111DGX-A100:
--mem-affinity 2:3:0:1:6:7:4:5 --cpu-affinity 32-47:48-63:0-15:16-31:96-111:112-127:64-79:80-95The NVIDIA HPL-MxP benchmark expects one GPU per MPI process. As such, set the number of MPI processes to match the number of available GPUs in the cluster.
Examples:
Run NVIDIA HPL-MxP on a 4 nodes, each node with 8 GPUs:
srun -N 4 --ntasks-per-node=8 --cpu-bind=none --mpi=pmix \ ./hpl-mxp.sh --n 280000 --nb 2048 --nprow 8 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7Run NVIDIA HPL-MxP on a 4 nodes of NVIDIA Grace Hopper:
srun -N 4 --ntasks-per-node=1 \ ./hpl-mxp.sh --n 380000 --nb 2048 --nprow 2 --npcol 2 --nporder row \ --cpu-affinity 0-71 --mem-affinity 0 --gpu-affinity 0where
--cpu-affinity
is mapping to cores on the local node and--mem-affinity
is mapping to NUMA-nodes on the local node.
Running the NVIDIA HPL-MxP Benchmarks on NVIDIA Grace CPU only systems#
The script hpl-mxp-aarch64.sh
can be invoked on a command line or through a Slurm batch script to launch the NVIDIA HPL-MxP benchmark for NVIDIA Grace CPU.
The script hpl-mxp-aarch64.sh requires the following parameters:
--nprow <int>
number of rows in the processor grid
--npcol <int>
number of columns in the processor grid
--nporder <string>
“row” or “column” major layout of the processor grid
--n <int>
size of N-by-N matrix
--nb <int>
nb is the blocking constant (panel size) The full list of accepted parameters can be found in README and TUNING files
Optional parameters:
--cpu-affinity <string>
colon separated list of cpu index ranges
--mem-affinity <string>
colon separated list of memory indices
--exec-name <string>
HPL-MxP executable file
--tolerance <float>
Tolerance of the HPL harness. [Default value = 1e-12]
--u-panel-chunk-nbs <int>
U panel chunk size given in unit of NBs. [Default value = 8]
Notes:
It is recommended to bind MPI process to NUMA node on NVIDIA Grace CPU, for example:
./hpl-mxp-aarch64.sh --n 16384 --nb 1023 --nprow 1 --npcol 1 --cpu-affinity 0-71:72-143 --mem-affinity 0:1
Examples:
Run NVIDIA HPL-MxP on a 4 nodes of NVIDIA Grace CPU:
srun -N 4 --ntasks-per-node=2 --cpu-bind=none --mpi=pmix \ ./hpl-mxp-aarch64.sh --n 180000 --nb 2048 --nprow 2 --npcol 4 --nporder row \ --cpu-affinity 0-71:72-143 --mem-affinity 0:1where
--cpu-affinity
is mapping to cores on the local node and--mem-affinity
is mapping to NUMA-nodes on the local node.