Release Notes#
v25.04#
Added support for FP64 Emulation for HPL
Contains:
NVIDIA NCCL 2.25.1
AWS OFI NCCL 1.6.0
NVIDIA NVSHMEM 3.2.5
NVIDIA GDR Copy 2.4
NVIDIA NVPL BLAS 25.1 (Arm SBSA only)
NVIDIA NVPL LAPACK 25.1 (Arm SBSA only)
NVIDIA NVPL Sparse 25.1 (Arm SBSA only)
LLVM OpenMP 18.1.1 (Arm SBSA only)
TCMalloc 4.5.3 (Arm SBSA only)
- Known issues:
If NVSHMEM is used in the HPL Benchmark and is initialized using a unique ID (UID), the benchmark may hang during a multi-node run. To workaround this issue, initialize NVSHMEM using MPI
export HPL_NVSHMEM_INIT=0
or disable NVSHMEMexport HPL_USE_NVSHMEM=0
.
v25.02#
Added support for NVIDIA Blackwell GPU architecture (sm100)
Added support for Linux Ubuntu 24.04
Prerequisites
CUDA 12.8 or newer
OpenMPI 4.1 or newer, or MPICH 3.4 or newer
Contains:
NVIDIA NCCL 2.25.1
AWS OFI NCCL 1.6.0
NVIDIA NVSHMEM 3.2.5
NVIDIA GDR Copy 2.4
NVIDIA NVPL BLAS 24.07 (Arm SBSA only)
NVIDIA NVPL LAPACK 24.07 (Arm SBSA only)
NVIDIA NVPL Sparse 24.07 (Arm SBSA only)
LLVM OpenMP 18.1.1 (Arm SBSA only)
TCMalloc 4.5.3 (Arm SBSA only)
Known issues:
HPCX 2.21 is known to have a long startup time on Blackwell. Enabling the compute cache (
export CUDA_CACHE_DISABLE=0
) can help reduce this delay.
v24.09#
Added support for OpenMPI 4.1 or newer
Added support for Linux Ubuntu 22.04
Prerequisites
CUDA 12.3 or newer
OpenMPI 4.1 or newer, or MPICH 3.4 or newer
Contains:
NVIDIA NCCL 2.22.3
AWS OFI NCCL 1.6.0
NVIDIA NVSHMEM 2.11
NVIDIA GDR Copy 2.4
NVIDIA NVPL BLAS 24.07 (Arm SBSA only)
NVIDIA NVPL LAPACK 24.07 (Arm SBSA only)
NVIDIA NVPL Sparse 24.07 (Arm SBSA only)
LLVM OpenMP 18.1.1 (Arm SBSA only)
TCMalloc 4.5.3 (Arm SBSA only)
Known issues:
HPL out-of-core (OOC): In case of experiencing GPU out-of-memory issues with HPL OOC, consider increasing the size of the GPU memory needed for the driver (not used by HPL OOC). This can be achieved by adjusting the
HPL_OOC_SAFE_SIZE
environment variable. Default value is2.0
(the buffer size in GB). Depending on the GPU/driver, you may need to increase this further to resolve memory issues.HPL-MxP: The input task must satisfy the following condition:
((N / NB) / npcol) / u-panel-chunk-nbs < 20
N
- size of N-by-N matrixNB
- the blocking constant (panel size)npcol
- number of columns in the processor gridu-panel-chunk-nbs
- U panel chunk size given in unit of NBs (default 8)
v24.05#
Initial release
Supported CPU Architectures: x86_64, NVIDIA Grace CPU (Arm SBSA)
Supported SM Architectures: NVIDIA Ampere GPU architecture (sm80) and NVIDIA Hopper GPU architecture (sm90)
Supported OS: Linux distributions with glibc >= 2.28 – RHEL 8.8 and SLES 15.5 have been tested.
Support MPI: Libraries that are ABI-compatible with MPICH (e.g., MPICH, Cray MPICH, MVAPICH, etc.)
Prerequisites
CUDA 12.3 or newer
MPICH 3.4 or newer
Contains:
NVIDIA NCCL 2.21.5
AWS OFI NCCL 1.6.0
NVIDIA NVSHMEM 2.11
NVIDIA GDR Copy 2.4
NVIDIA NVPL BLAS 24.03 (Arm SBSA only)
NVIDIA NVPL LAPACK 24.03 (Arm SBSA only)
LLVM OpenMP 18.1.1 (Arm SBSA only)
TCMalloc 4.5.3 (Arm SBSA only)