NVIDIA HPL Benchmark#

x86 package folder structure#

hpl.sh script in the root directory of the package to invoke the xhpl executable.

The NVIDIA HPL benchmark in the folder ./hpl-linux-x86_64 contains:

  • xhpl executable

  • Samples of Slurm batch-job scripts in sample-slurm directory

  • Samples of input files in sample-dat directory

  • README, RUNNING, and TUNING guides

aarch64 package folder structure#

hpl-aarch64.sh script in the root directory of the package to invoke the xhpl executable for NVIDIA Grace CPU.

hpl.sh script in the root directory of the package to invoke the xhpl executable for NVIDIA Grace Hopper or NVIDIA Grace Blackwell.

  • NVIDIA Grace CPU only NVIDIA HPL benchmark in the folder ./hpl-linux-aarch64 contains:

    • xhpl executable for NVIDIA Grace CPU

    • Samples of Slurm batch-job scripts in sample-slurm directory

    • Samples of input files in sample-dat directory

    • README, RUNNING, and TUNING guides

  • The NVIDIA HPL benchmark in the folder ./hpl-linux-aarch64-gpu contains:

    • xhpl executable for NVIDIA Grace Hopper or NVIDIA Grace Blackwell

    • Samples of Slurm batch-job scripts in sample-slurm directory

    • Samples of input files in sample-dat directory

    • README, RUNNING, and TUNING guides

Running the NVIDIA HPL Benchmarks on x86_64 with NVIDIA GPUs, NVIDIA Grace Hopper, and NVIDIA Grace Blackwell systems#

The NVIDIA HPL benchmark uses input format as the standard Netlib HPL benchmark. Please see the Netlib HPL benchmark to get started with the HPL software concepts and best practices.

The NVIDIA HPL benchmark ignores the following parameters from input file:

  • Broadcast parameters

  • Panel factorization parameters

  • Look-ahead value

  • L1 layout

  • U layout

  • Equilibration parameter

  • Memory alignment parameter

The script hpl.sh can be invoked on a command line or through a slurm batch- script to launch the NVIDIA HPL benchmark.

The script hpl.sh accepts the following parameters:

  • --dat <string> path to HPL.dat (Required)

  • --cuda-compat manually enable CUDA forward compatibility (Optional)

  • --no-multinode enable flags for no-multinode (no-network) execution (Optional)

  • --gpu-affinity <string> colon separated list of gpu indices (Optional)

  • --cpu-affinity <string> colon separated list of cpu index ranges (Optional)

  • --mem-affinity <string> colon separated list of memory indices (Optional)

  • --ucx-affinity <string> colon separated list of UCX devices (Optional)

  • --ucx-tls <string> UCX transport to use (Optional)

  • --exec-name <string> HPL executable file (Optional)

  • --no-multinode enable flags for no-multinode (no-network) execution (Optional)

Notes:

  • Affinity example:

    • DGX-H100 and DGX-B200:

      --mem-affinity 0:0:0:0:1:1:1:1 --cpu-affinity 0-13:14-27:28-41:42-55:56-69:70-83:84-97:98-111
      
    • DGX-A100:

      --mem-affinity 2:3:0:1:6:7:4:5 --cpu-affinity 32-47:48-63:0-15:16-31:96-111:112-127:64-79:80-95
      
  • The NVIDIA HPL benchmark expects one GPU per MPI process. As such, set the number of MPI processes to match the number of available GPUs in the cluster.

Examples:

Run NVIDIA HPL Benchmark on nodes 16 with 4 GPUs (or 8 nodes with 8 GPUs) using provided sample HPL-dgx-64GPUs.dat files:

srun -N 16 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
    ./hpl.sh --dat ./hpl-linux-x86_64/sample-dat/HPL-64GPUs.dat

Run NVIDIA HPL on a single node of NVIDIA Grace Hopper x4:

srun -N 1 --ntasks-per-node=4 --cpu-bind=none --mpi=pmix \
    ./hpl.sh --dat ./hpl-linux-x86_64/sample-dat/HPL-4GPUs.dat \
    --cpu-affinity 0-71:72-143:144-215:216-287 \
    --mem-affinity 0:1:2:3

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

NVIDIA HPL Benchmark Environment variables#

The NVIDIA HPL takes several runtime environment variables to improve the performance on different platforms.

Variable

Default Value

Possible Values

Description

HPL_P2P_AS_BCAST

1

0 (NCCL bcast), 1 (NCCL send/recv), 2 (CUDA-aware MPI), 3 (host MPI), 4 (NVSHMEM)

Which communication library to use in the final solve step

HPL_USE_NVSHMEM

1

1 (enable), 0 (disable)

Enables/disables NVSHMEM support in HPL

HPL_NVSHMEM_INIT

0

0 (using MPI), 1 (using unique ID (UID))

NVSHMEM initialization type

HPL_FCT_COMM_POLICY

0

0 (NVSHMEM), 1 (host MPI)

Which communication library to use in the panel factorization

HPL_NVSHMEM_SWAP

0

1 (enable), 0 (disable)

Performs row swaps using NVSHMEM instead of NCCL (default)

HPL_CHUNK_SIZE_NBS

16

>0

Number of matrix blocks (size NB) to group together for computations

HPL_DIST_TRSM_FLAG

1

1 (enable), 0 (disable)

Perform the solve step (TRSM) in parallel rather than on only the ranks that own that part of the matrix

HPL_CTA_PER_FCT

16

>0

Sets the number of CTAs (thread blocks) for factorization

HPL_ALLOC_HUGEPAGES

0

1 (enable), 0 (disable)

Use 2MB hugepages for host-side allocations. Done through the madvise syscall and requires /sys/kernel/mm/transparent_hugepage/enabled to be set to madvise to have an effect

WARMUP_END_PROG

5

-1-100

Runs the main loop once before the ‘real’ run Stops the warmup loop at x%

TEST_LOOPS

1

>0

Runs the main loop x many times

HPL_CUSOLVER_MP_TESTS

1

1 (enable), 0 (disable)

Runs several tests of individual components of HPL (GEMMS, comms, etc.)

HPL_CUSOLVER_MP_TESTS_GEMM_ITERS

128

>0

Number of repeat GEMM calls in tests

NVIDIA HPL Benchmark Out-of-core mode#

The NVIDIA HPL benchmark has out-of-core mode. This is an opt-in feature and the default mode remains the in-core mode.

The NVIDIA HPL out-of-core mode enables the use of larger matrix sizes. Unlike the in-core mode, any matrix data that exceeds GPU memory capacity is automatically stored in the host CPU memory. To activate this feature, simply set the environment variable HPL_OOC_MODE=1 and specify a larger matrix size (e.g., using the N parameter in the input file).

Performance will depend on host-device transfer speeds. The amount of host memory used for the matrix data should be varied depending on the platform:

  • On PCIe-based systems (e.g., x86), aim to use 6-16 GiB of host memory for matrix data.

  • On NVIDIA Grace Hopper and NVIDIA Grace Blackwell systems, aim to use 50-90 GiB of host memory for matrix data.

A method to estimate the matrix size (N) for this feature is to take the largest per GPU memory size used with the NVIDIA HPL in-core mode, add the target amount of host data, and then work out the new matrix size from this total size.

Environment variables to setup and control the NVIDIA HPL Benchmark out-of-core mode:

Variable

Default Value

Possible Values

Description

HPL_OOC_MODE

0

1 (enable), 0 (disable)

Enables/disables out-of-core mode

HPL_OOC_MAX_GPU_MEM

-1

>=-1

Limits the amount of GPU memory used for OOC (measured in GiB)

HPL_OOC_TILE_M

4096

>0

Row blocking factor

HPL_OOC_TILE_N

4096

>0

Column blocking factor

HPL_OOC_NUM_STREAMS

3

>0

Number of streams used for OOC operations

HPL_OOC_SAFE_SIZE

3.0

>0

GPU memory (in GiB) needed for driver, this amount of memory will not be used by HPL OOC

If the NVIDIA HPL out-of-core mode is enabled, it is highly recommended to pass the CPU, GPU, and memory affinity arguments to hpl.sh.

In case of experiencing GPU out-of-memory issues with HPL OOC, consider increasing the size of the GPU memory needed for the driver (not used by HPL OOC). This can be achieved by adjusting the HPL_OOC_SAFE_SIZE environment variable. Default value is 3.0 (the buffer size in GB). Depending on the GPU/driver, you may need to increase this further to resolve memory issues.

NVIDIA HPL Benchmark with FP64 emulation#

The NVIDIA HPL benchmark supports FP64 emulation mode [1] on the NVIDIA Blackwell GPU architecture, using the techniques described in [2]. This is an opt-in feature, and the default mode remains the use of the native FP64 computations.

Environment variables to set up and control the NVIDIA HPL Benchmark FP64 emulation mode:

Variable

Default Value

Possible Values

Description

HPL_EMULATE_DOUBLE_PRECISION

0

1 (enable) 0 (disable)

Enables/disables FP64 emulation mode

HPL_DOUBLE_PRECISION_EMULATION_MANTISSA_BIT_COUNT

53

>0

The maximum number of mantissa bits to be used in FP64 emulation [2]. (includes IEEE FP64 standard’s implicit bit)

Notes:

  • The number of slices (INT8 data elements [2]) can be calculated as: nSlices = ceildiv((mantissaBitCount + 1), sizeofBits(INT8)), where the additional bit is used for the sign (+/-) of the value.

  • In the current iteration of the NVIDIA HPL benchmark, FP64 emulation utilizes INT8 data elements and compute resources. This may change in future releases.

Running the NVIDIA HPL Benchmarks on NVIDIA Grace CPU only systems#

The NVIDIA HPL benchmark uses input format as the standard Netlib HPL benchmark. Please see the Netlib HPL benchmark for getting started with the HPL software concepts and best practices.

The script hpl.sh can be invoked on a command line or through a slurm batch- script to launch the NVIDIA HPL benchmark.

The script hpl.sh accepts the following parameters:

  • --dat path to HPL.dat (Required)

  • --gpu-affinity <string> colon separated list of GPU index ranges (Optional)

  • --cpu-affinity <string> colon separated list of CPU index ranges (Optional)

  • --mem-affinity <string> colon separated list of memory indices (Optional)

  • --ucx-affinity <string> colon separated list of UCX devices (Optional)

  • --ucx-tls <string> UCX transport to use (Optional)

  • --exec-name <string> HPL executable file (Optional)

  • --no-multinode enable flags for no-multinode (no-network) execution

Notes:

  • It is recommended to bind MPI process to NUMA node on NVIDIA Grace CPU, for example:

    ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1
    

Examples:

Run NVIDIA HPL on two nodes of NVIDIA Grace CPU using your custom HPL.dat file:

srun -N 2 --ntasks-per-node=2 --cpu-bind=none --mpi=pmix \
    ./hpl-aarch64.sh --dat /my-dat-files/HPL.dat --cpu-affinity 0-71:72-143 --mem-affinity 0:1

where --cpu-affinity is mapping to cores on the local node and --mem-affinity is mapping to NUMA-nodes on the local node.

References#

[1] Hardware Trends Impacting Floating-Point Computations In Scientific Applications

[2] Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit