Release Notes#

v26.02#

  • Added support for FP64 Emulation DGEMM with Ozaki-II scheme in the NVIDIA HPL Benchmark.

  • Added support for (G)B300 (sm103).

    • The NVIDIA HPL Benchmark does not support FP64 emulation on (G)B300 (sm103).

    • (G)B300 (sm103) has low throughput of native FP64, therefore, the performance of the NVIDIA HPL Benchmark on this hardware is expected to low.

  • AWS OFI NCCL 1.6.0 has been removed from the HPC Benchmark package. Users are now encouraged to select an AWS OFI NCCL version that aligns with the particular configuration of their system. For installation and configuration instructions for AWS OFI NCCL, please refer to: HewlettPackard/shs-ccl-docs.

  • Contains:

    • NVIDIA NCCL 2.29.2

    • NVIDIA NVSHMEM 3.5.19

    • NVIDIA NVPL BLAS 25.1 (Arm SBSA only)

    • NVIDIA NVPL LAPACK 25.1 (Arm SBSA only)

    • NVIDIA NVPL Sparse 25.1 (Arm SBSA only)

    • LLVM OpenMP 18.1.1 (Arm SBSA only)

  • Known issues:

    • Known performance degradation of the NVIDIA HPL Benchmark for Grace CPUs with the latest NVIDIA HPC-X. To workaround this issue, NVIDIA HPC-X 2.18 is recommended.

    • Known performance issue with HPC-X 2.25 affects the MPI_Alltoall, MPI_Bcast, and MPI_Allgather operations, which may impact HPL and HPL-MxP performance. Refer to the HPC-X 2.25 release notes for workarounds to resolve this issue.

    • NVIDIA NVSHMEM has a known issue with MPICH when running HPL, so it is disabled by default in both the MPICH x86 and SBSA releases; to enable it, add export HPL_USE_NVSHMEM=1 inside the hpc-benchmarks-gpu-env.sh script and use an older NVSHMEM version (e.g., 3.2.5).

v25.09#

  • Added support for CUDA 13 on devices with Compute Capability 8.0 (Ampere) and above.

  • FP4 support added for the NVIDIA HPL-MxP Benchmark.

  • The NVIDIA HPC Benchmarks package v25.09 includes microbenchmarks designed to assess system readiness before running large-scale benchmarks.

  • Contains:

    • NVIDIA NCCL 2.27.7

    • AWS OFI NCCL 1.6.0

    • NVIDIA NVSHMEM 3.4.5

    • NVIDIA NVPL BLAS 25.1 (Arm SBSA only)

    • NVIDIA NVPL LAPACK 25.1 (Arm SBSA only)

    • NVIDIA NVPL Sparse 25.1 (Arm SBSA only)

    • LLVM OpenMP 18.1.1 (Arm SBSA only)

  • Known issues:

    • Performance of the NVIDIA HPL-MxP Benchmark highly depends on NVIDIA cuBLAS library. The NVIDIA cuBLAS library from CUDA Toolkit 13 Update 1 or newer substantially improves the performance of FP4 GEMM.

    • Performance of the NVIDIA HPCG Benchmark highly depends on NVIDIA cuSPARSE library. The NVIDIA cuSPARSE library from CUDA Toolkit 13 Update 1 or newer improves the performance of HPCG Benchmark.

    • NVIDIA NVSHMEM 3.4.5 has a known issue with MPICH when running HPL, so it is disabled by default in both the MPICH x86 and SBSA releases; to enable it, add export HPL_USE_NVSHMEM=1 inside the hpc-benchmarks-gpu-env.sh script and use an older NVSHMEM version (e.g., 3.2.5).

v25.04#

  • Added support for FP64 Emulation for HPL

  • Contains:

    • NVIDIA NCCL 2.25.1

    • AWS OFI NCCL 1.6.0

    • NVIDIA NVSHMEM 3.2.5

    • NVIDIA GDR Copy 2.4

    • NVIDIA NVPL BLAS 25.1 (Arm SBSA only)

    • NVIDIA NVPL LAPACK 25.1 (Arm SBSA only)

    • NVIDIA NVPL Sparse 25.1 (Arm SBSA only)

    • LLVM OpenMP 18.1.1 (Arm SBSA only)

    • TCMalloc 4.5.3 (Arm SBSA only)

  • Known issues:
    • If NVIDIA NVSHMEM is used in the HPL Benchmark and is initialized using a unique ID (UID), the benchmark may hang during a multi-node run. To workaround this issue, initialize NVSHMEM using MPI export HPL_NVSHMEM_INIT=0 or disable NVSHMEM export HPL_USE_NVSHMEM=0.

v25.02#

  • Added support for NVIDIA Blackwell GPU architecture (sm100)

  • Added support for Linux Ubuntu 24.04

  • Prerequisites

    • CUDA 12.8 or newer

    • OpenMPI 4.1 or newer, or MPICH 3.4 or newer

  • Contains:

    • NVIDIA NCCL 2.25.1

    • AWS OFI NCCL 1.6.0

    • NVIDIA NVSHMEM 3.2.5

    • NVIDIA GDR Copy 2.4

    • NVIDIA NVPL BLAS 24.07 (Arm SBSA only)

    • NVIDIA NVPL LAPACK 24.07 (Arm SBSA only)

    • NVIDIA NVPL Sparse 24.07 (Arm SBSA only)

    • LLVM OpenMP 18.1.1 (Arm SBSA only)

    • TCMalloc 4.5.3 (Arm SBSA only)

  • Known issues:

    • HPCX 2.21 is known to have a long startup time on Blackwell. Enabling the compute cache (export CUDA_CACHE_DISABLE=0) can help reduce this delay.

v24.09#

  • Added support for OpenMPI 4.1 or newer

  • Added support for Linux Ubuntu 22.04

  • Prerequisites

    • CUDA 12.3 or newer

    • OpenMPI 4.1 or newer, or MPICH 3.4 or newer

  • Contains:

    • NVIDIA NCCL 2.22.3

    • AWS OFI NCCL 1.6.0

    • NVIDIA NVSHMEM 2.11

    • NVIDIA GDR Copy 2.4

    • NVIDIA NVPL BLAS 24.07 (Arm SBSA only)

    • NVIDIA NVPL LAPACK 24.07 (Arm SBSA only)

    • NVIDIA NVPL Sparse 24.07 (Arm SBSA only)

    • LLVM OpenMP 18.1.1 (Arm SBSA only)

    • TCMalloc 4.5.3 (Arm SBSA only)

  • Known issues:

    • HPL out-of-core (OOC): In case of experiencing GPU out-of-memory issues with HPL OOC, consider increasing the size of the GPU memory needed for the driver (not used by HPL OOC). This can be achieved by adjusting the HPL_OOC_SAFE_SIZE environment variable. Default value is 2.0 (the buffer size in GB). Depending on the GPU/driver, you may need to increase this further to resolve memory issues.

    • HPL-MxP: The input task must satisfy the following condition:

      ((N / NB) / npcol) / u-panel-chunk-nbs < 20
      
      • N - size of N-by-N matrix

      • NB - the blocking constant (panel size)

      • npcol - number of columns in the processor grid

      • u-panel-chunk-nbs - U panel chunk size given in unit of NBs (default 8)

v24.05#

  • Initial release

  • Supported CPU Architectures: x86_64, NVIDIA Grace CPU (Arm SBSA)

  • Supported SM Architectures: NVIDIA Ampere GPU architecture (sm80) and NVIDIA Hopper GPU architecture (sm90)

  • Supported OS: Linux distributions with glibc >= 2.28 – RHEL 8.8 and SLES 15.5 have been tested.

  • Support MPI: Libraries that are ABI-compatible with MPICH (e.g., MPICH, Cray MPICH, MVAPICH, etc.)

  • Prerequisites

    • CUDA 12.3 or newer

    • MPICH 3.4 or newer

  • Contains:

    • NVIDIA NCCL 2.21.5

    • AWS OFI NCCL 1.6.0

    • NVIDIA NVSHMEM 2.11

    • NVIDIA GDR Copy 2.4

    • NVIDIA NVPL BLAS 24.03 (Arm SBSA only)

    • NVIDIA NVPL LAPACK 24.03 (Arm SBSA only)

    • LLVM OpenMP 18.1.1 (Arm SBSA only)

    • TCMalloc 4.5.3 (Arm SBSA only)