NVIDIA HPL Benchmark#

x86 package folder structure#

hpl.sh script in the root directory of the package to invoke the xhpl executable.

The NVIDIA HPL benchmark in the folder ./hpl-linux-x86_64 contains:

xhpl executable

Samples of Slurm batch-job scripts in sample-slurm directory

Samples of input files in sample-dat directory

README, RUNNING, and TUNING guides

aarch64 package folder structure#

hpl-aarch64.sh script in the root directory of the package to invoke the xhpl executable for NVIDIA Grace CPU.

hpl.sh script in the root directory of the package to invoke the xhpl executable for NVIDIA Grace Hopper or NVIDIA Grace Blackwell.

NVIDIA Grace CPU only NVIDIA HPL benchmark in the folder ./hpl-linux-aarch64 contains:
- xhpl executable for NVIDIA Grace CPU
- Samples of Slurm batch-job scripts in sample-slurm directory
- Samples of input files in sample-dat directory
- README, RUNNING, and TUNING guides
The NVIDIA HPL benchmark in the folder ./hpl-linux-aarch64-gpu contains:
- xhpl executable for NVIDIA Grace Hopper or NVIDIA Grace Blackwell
- Samples of Slurm batch-job scripts in sample-slurm directory
- Samples of input files in sample-dat directory
- README, RUNNING, and TUNING guides

Running the NVIDIA HPL Benchmarks on x86_64 with NVIDIA GPUs, NVIDIA Grace Hopper, and NVIDIA Grace Blackwell systems#

The NVIDIA HPL benchmark uses input format as the standard Netlib HPL benchmark. Please see the Netlib HPL benchmark to get started with the HPL software concepts and best practices.

The NVIDIA HPL benchmark ignores the following parameters from input file:

Broadcast parameters

Panel factorization parameters

Look-ahead value

L1 layout

U layout

Equilibration parameter

Memory alignment parameter

The script hpl.sh can be invoked on a command line or through a slurm batch- script to launch the NVIDIA HPL benchmark.

The script hpl.sh accepts the following parameters:

--dat <string> path to HPL.dat (Required)

--cuda-compat manually enable CUDA forward compatibility (Optional)

--no-multinode enable flags for no-multinode (no-network) execution (Optional)

--gpu-affinity <string> colon separated list of gpu indices (Optional)

--cpu-affinity <string> colon separated list of cpu index ranges (Optional)

--mem-affinity <string> colon separated list of memory indices (Optional)

--ucx-affinity <string> colon separated list of UCX devices (Optional)

--ucx-tls <string> UCX transport to use (Optional)

--exec-name <string> HPL executable file (Optional)

--no-multinode enable flags for no-multinode (no-network) execution (Optional)

Notes:

Affinity example:
DGX-H100 and DGX-B200:
--mem-affinity 0:0:0:0:1:1:1:1 --cpu-affinity 0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111
DGX-A100:
--mem-affinity 2:3:0:1:6:7:4:5 --cpu-affinity 32-47:48-63:0-15:16-31:96-111:112-127:64-79:80-95
The NVIDIA HPL benchmark expects one GPU per MPI process. As such, set the number of MPI processes to match the number of available GPUs in the cluster.

Examples:

Run NVIDIA HPL Benchmark on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using provided sample HPL-dgx-64GPUs.dat files:
srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
    ./hpl.sh --dat ./hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat
Run NVIDIA HPL on a single node of NVIDIA Grace Hopper x4:
srun -N 1 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
    ./hpl.sh --dat ./hpl-linux-x86_64/sample-dat/HPL-4GPUs.dat \
    --cpu-affinity 0-71:72-143:144-215:216-287 \
    --mem-affinity 0:1:2:3
where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

NVIDIA HPL Benchmark Environment variables#

The NVIDIA HPL takes several runtime environment variables to improve the performance on different platforms.

Variable	Default Value	Possible Values	Description
`HPL_P2P_AS_BCAST`	1	0 (NCCL bcast), 1 (NCCL send/recv), 2 (CUDA-aware MPI), 3 (host MPI), 4 (NVSHMEM)	Which communication library to use in the final solve step
`HPL_USE_NVSHMEM`	1	1 (enable), 0 (disable)	Enables/disables NVSHMEM support in HPL
`HPL_NVSHMEM_INIT`	0	0 (using MPI), 1 (using unique ID (UID))	NVSHMEM initialization type
`HPL_FCT_COMM_POLICY`	0	0 (NVSHMEM), 1 (host MPI)	Which communication library to use in the panel factorization
`HPL_NVSHMEM_SWAP`	0	1 (enable), 0 (disable)	Performs row swaps using NVSHMEM instead of NCCL (default)
`HPL_CHUNK_SIZE_NBS`	16	>0	Number of matrix blocks (size NB) to group together for computations
`HPL_DIST_TRSM_FLAG`	1	1 (enable), 0 (disable)	Perform the solve step (TRSM) in parallel rather than on only the ranks that own that part of the matrix
`HPL_CTA_PER_FCT`	16	>0	Sets the number of CTAs (thread blocks) for factorization
`HPL_ALLOC_HUGEPAGES`	0	1 (enable), 0 (disable)	Use 2MB hugepages for host-side allocations. Done through the madvise syscall and requires /sys/kernel/mm/transparent_hugepage/enabled to be set to madvise to have an effect
`WARMUP_END_PROG`	5	-1-100	Runs the main loop once before the ‘real’ run Stops the warmup loop at x%
`TEST_LOOPS`	1	>0	Runs the main loop x many times
`HPL_CUSOLVER_MP_TESTS`	1	1 (enable), 0 (disable)	Runs several tests of individual components of HPL (GEMMS, comms, etc.)
`HPL_CUSOLVER_MP_TESTS_GEMM_ITERS`	128	>0	Number of repeat GEMM calls in tests

NVIDIA HPL Benchmark Out-of-core mode#

The NVIDIA HPL benchmark has out-of-core mode. This is an opt-in feature and the default mode remains the in-core mode.

The NVIDIA HPL out-of-core mode enables the use of larger matrix sizes. Unlike the in-core mode, any matrix data that exceeds GPU memory capacity is automatically stored in the host CPU memory. To activate this feature, simply set the environment variable HPL_OOC_MODE=1 and specify a larger matrix size (e.g., using the N parameter in the input file).

Performance will depend on host-device transfer speeds. The amount of host memory used for the matrix data should be varied depending on the platform:

On PCIe-based systems (e.g., x86), aim to use 6-16 GiB of host memory for matrix data.
On NVIDIA Grace Hopper and NVIDIA Grace Blackwell systems, aim to use 50-90 GiB of host memory for matrix data.

A method to estimate the matrix size (N) for this feature is to take the largest per GPU memory size used with the NVIDIA HPL in-core mode, add the target amount of host data, and then work out the new matrix size from this total size.

Environment variables to setup and control the NVIDIA HPL Benchmark out-of-core mode:

Variable	Default Value	Possible Values	Description
`HPL_OOC_MODE`	0	1 (enable), 0 (disable)	Enables/disables out-of-core mode
`HPL_OOC_MAX_GPU_MEM`	-1	>=-1	Limits the amount of GPU memory used for OOC (measured in GiB)
`HPL_OOC_TILE_M`	4096	>0	Row blocking factor
`HPL_OOC_TILE_N`	4096	>0	Column blocking factor
`HPL_OOC_NUM_STREAMS`	3	>0	Number of streams used for OOC operations
`HPL_OOC_SAFE_SIZE`	3.0	>0	GPU memory (in GiB) needed for driver, this amount of memory will not be used by HPL OOC

If the NVIDIA HPL out-of-core mode is enabled, it is highly recommended to pass the CPU, GPU, and memory affinity arguments to hpl.sh.

In case of experiencing GPU out-of-memory issues with HPL OOC, consider increasing the size of the GPU memory needed for the driver (not used by HPL OOC). This can be achieved by adjusting the HPL_OOC_SAFE_SIZE environment variable. Default value is 3.0 (the buffer size in GB). Depending on the GPU/driver, you may need to increase this further to resolve memory issues.

NVIDIA HPL Benchmark with FP64 emulation#

The NVIDIA HPL benchmark supports FP64 emulation mode [1] on the NVIDIA Blackwell GPU architecture, using the techniques described in [2]. This is an opt-in feature, and the default mode remains the use of the native FP64 computations.

Environment variables to set up and control the NVIDIA HPL Benchmark FP64 emulation mode:

Variable	Default Value	Possible Values	Description
`HPL_EMULATE_DOUBLE_PRECISION`	0	1 (enable) 0 (disable)	Enables/disables FP64 emulation mode
`HPL_DOUBLE_PRECISION_EMULATION_MANTISSA_BIT_COUNT`	53	>0	The maximum number of mantissa bits to be used in FP64 emulation [2]. (includes IEEE FP64 standard’s implicit bit)

Notes:

The number of slices (INT8 data elements [2]) can be calculated as: nSlices = ceildiv((mantissaBitCount + 1), sizeofBits(INT8)), where the additional bit is used for the sign (+/-) of the value.

In the current iteration of the NVIDIA HPL benchmark, FP64 emulation utilizes INT8 data elements and compute resources. This may change in future releases.

Running the NVIDIA HPL Benchmarks on NVIDIA Grace CPU only systems#

The NVIDIA HPL benchmark uses input format as the standard Netlib HPL benchmark. Please see the Netlib HPL benchmark for getting started with the HPL software concepts and best practices.

The script hpl.sh can be invoked on a command line or through a slurm batch- script to launch the NVIDIA HPL benchmark.

The script hpl.sh accepts the following parameters:

--dat path to HPL.dat (Required)

--gpu-affinity <string> colon separated list of GPU index ranges (Optional)

--cpu-affinity <string> colon separated list of CPU index ranges (Optional)

--mem-affinity <string> colon separated list of memory indices (Optional)

--ucx-affinity <string> colon separated list of UCX devices (Optional)

--ucx-tls <string> UCX transport to use (Optional)

--exec-name <string> HPL executable file (Optional)

--no-multinode enable flags for no-multinode (no-network) execution

Notes:

It is recommended to bind MPI process to NUMA node on NVIDIA Grace CPU, for example:
./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1

Examples:

Run NVIDIA HPL on two nodes of NVIDIA Grace CPU using your custom HPL.dat file:
srun -N 2 --ntasks-per-node=2 --cpu-bind=none --mpi=pmix \
    ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1
where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

References#

[1] Hardware Trends Impacting Floating-Point Computations In Scientific Applications

[2] Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit