Release Notes#

v25.04#

  • Added support for FP64 Emulation for HPL

  • Contains:

    • NVIDIA NCCL 2.25.1

    • AWS OFI NCCL 1.6.0

    • NVIDIA NVSHMEM 3.2.5

    • NVIDIA GDR Copy 2.4

    • NVIDIA NVPL BLAS 25.1 (Arm SBSA only)

    • NVIDIA NVPL LAPACK 25.1 (Arm SBSA only)

    • NVIDIA NVPL Sparse 25.1 (Arm SBSA only)

    • LLVM OpenMP 18.1.1 (Arm SBSA only)

    • TCMalloc 4.5.3 (Arm SBSA only)

  • Known issues:
    • If NVSHMEM is used in the HPL Benchmark and is initialized using a unique ID (UID), the benchmark may hang during a multi-node run. To workaround this issue, initialize NVSHMEM using MPI export HPL_NVSHMEM_INIT=0 or disable NVSHMEM export HPL_USE_NVSHMEM=0.

v25.02#

  • Added support for NVIDIA Blackwell GPU architecture (sm100)

  • Added support for Linux Ubuntu 24.04

  • Prerequisites

    • CUDA 12.8 or newer

    • OpenMPI 4.1 or newer, or MPICH 3.4 or newer

  • Contains:

    • NVIDIA NCCL 2.25.1

    • AWS OFI NCCL 1.6.0

    • NVIDIA NVSHMEM 3.2.5

    • NVIDIA GDR Copy 2.4

    • NVIDIA NVPL BLAS 24.07 (Arm SBSA only)

    • NVIDIA NVPL LAPACK 24.07 (Arm SBSA only)

    • NVIDIA NVPL Sparse 24.07 (Arm SBSA only)

    • LLVM OpenMP 18.1.1 (Arm SBSA only)

    • TCMalloc 4.5.3 (Arm SBSA only)

  • Known issues:

    • HPCX 2.21 is known to have a long startup time on Blackwell. Enabling the compute cache (export CUDA_CACHE_DISABLE=0) can help reduce this delay.

v24.09#

  • Added support for OpenMPI 4.1 or newer

  • Added support for Linux Ubuntu 22.04

  • Prerequisites

    • CUDA 12.3 or newer

    • OpenMPI 4.1 or newer, or MPICH 3.4 or newer

  • Contains:

    • NVIDIA NCCL 2.22.3

    • AWS OFI NCCL 1.6.0

    • NVIDIA NVSHMEM 2.11

    • NVIDIA GDR Copy 2.4

    • NVIDIA NVPL BLAS 24.07 (Arm SBSA only)

    • NVIDIA NVPL LAPACK 24.07 (Arm SBSA only)

    • NVIDIA NVPL Sparse 24.07 (Arm SBSA only)

    • LLVM OpenMP 18.1.1 (Arm SBSA only)

    • TCMalloc 4.5.3 (Arm SBSA only)

  • Known issues:

    • HPL out-of-core (OOC): In case of experiencing GPU out-of-memory issues with HPL OOC, consider increasing the size of the GPU memory needed for the driver (not used by HPL OOC). This can be achieved by adjusting the HPL_OOC_SAFE_SIZE environment variable. Default value is 2.0 (the buffer size in GB). Depending on the GPU/driver, you may need to increase this further to resolve memory issues.

    • HPL-MxP: The input task must satisfy the following condition:

      ((N / NB) / npcol) / u-panel-chunk-nbs < 20
      
      • N - size of N-by-N matrix

      • NB - the blocking constant (panel size)

      • npcol - number of columns in the processor grid

      • u-panel-chunk-nbs - U panel chunk size given in unit of NBs (default 8)

v24.05#

  • Initial release

  • Supported CPU Architectures: x86_64, NVIDIA Grace CPU (Arm SBSA)

  • Supported SM Architectures: NVIDIA Ampere GPU architecture (sm80) and NVIDIA Hopper GPU architecture (sm90)

  • Supported OS: Linux distributions with glibc >= 2.28 – RHEL 8.8 and SLES 15.5 have been tested.

  • Support MPI: Libraries that are ABI-compatible with MPICH (e.g., MPICH, Cray MPICH, MVAPICH, etc.)

  • Prerequisites

    • CUDA 12.3 or newer

    • MPICH 3.4 or newer

  • Contains:

    • NVIDIA NCCL 2.21.5

    • AWS OFI NCCL 1.6.0

    • NVIDIA NVSHMEM 2.11

    • NVIDIA GDR Copy 2.4

    • NVIDIA NVPL BLAS 24.03 (Arm SBSA only)

    • NVIDIA NVPL LAPACK 24.03 (Arm SBSA only)

    • LLVM OpenMP 18.1.1 (Arm SBSA only)

    • TCMalloc 4.5.3 (Arm SBSA only)