Release Notes#

v25.09#

Added support for CUDA 13 on devices with Compute Capability 8.0 (Ampere) and above.
FP4 support added for the NVIDIA HPL-MxP Benchmark.
The NVIDIA HPC Benchmarks package v25.09 includes microbenchmarks designed to assess system readiness before running large-scale benchmarks.
- NCCL tests
- NVSHMEM performance tests
- OSU MPI benchmark
- GEMM (matrix-matrix multiplication) benchmark
Contains:
- NVIDIA NCCL 2.27.7
- AWS OFI NCCL 1.6.0
- NVIDIA NVSHMEM 3.4.5
- NVIDIA NVPL BLAS 25.1 (Arm SBSA only)
- NVIDIA NVPL LAPACK 25.1 (Arm SBSA only)
- NVIDIA NVPL Sparse 25.1 (Arm SBSA only)
Known issues:
- Performance of the NVIDIA HPL-MxP Benchmark highly depends on NVIDIA cuBLAS library. The NVIDIA cuBLAS library from CUDA Toolkit 13 Update 1 or newer substantially improves the performance of FP4 GEMM.
- Performance of the NVIDIA HPCG Benchmark highly depends on NVIDIA cuSPARSE library. The NVIDIA cuSPARSE library from CUDA Toolkit 13 Update 1 or newer improves the performance of HPCG Benchmark.
- NVSHMEM 3.4.5 has a known issue with MPICH when running HPL, so it is disabled by default in both the MPICH x86 and SBSA releases; to enable it, add export HPL_USE_NVSHMEM=1 inside the hpc-benchmarks-gpu-env.sh script and use an older NVSHMEM version (e.g., 3.2.5).

Added support for FP64 Emulation for HPL
Contains:
- NVIDIA NCCL 2.25.1
- AWS OFI NCCL 1.6.0
- NVIDIA NVSHMEM 3.2.5
- NVIDIA GDR Copy 2.4
- NVIDIA NVPL BLAS 25.1 (Arm SBSA only)
- NVIDIA NVPL LAPACK 25.1 (Arm SBSA only)
- NVIDIA NVPL Sparse 25.1 (Arm SBSA only)
- LLVM OpenMP 18.1.1 (Arm SBSA only)
- TCMalloc 4.5.3 (Arm SBSA only)
Known issues:
- If NVSHMEM is used in the HPL Benchmark and is initialized using a unique ID (UID), the benchmark may hang during a multi-node run. To workaround this issue, initialize NVSHMEM using MPI export HPL_NVSHMEM_INIT=0 or disable NVSHMEM export HPL_USE_NVSHMEM=0.

Added support for NVIDIA Blackwell GPU architecture (sm100)
Added support for Linux Ubuntu 24.04
Prerequisites
- CUDA 12.8 or newer
- OpenMPI 4.1 or newer, or MPICH 3.4 or newer
Contains:
- NVIDIA NCCL 2.25.1
- AWS OFI NCCL 1.6.0
- NVIDIA NVSHMEM 3.2.5
- NVIDIA GDR Copy 2.4
- NVIDIA NVPL BLAS 24.07 (Arm SBSA only)
- NVIDIA NVPL LAPACK 24.07 (Arm SBSA only)
- NVIDIA NVPL Sparse 24.07 (Arm SBSA only)
- LLVM OpenMP 18.1.1 (Arm SBSA only)
- TCMalloc 4.5.3 (Arm SBSA only)
Known issues:
- HPCX 2.21 is known to have a long startup time on Blackwell. Enabling the compute cache (export CUDA_CACHE_DISABLE=0) can help reduce this delay.

Added support for OpenMPI 4.1 or newer
Added support for Linux Ubuntu 22.04
Prerequisites
- CUDA 12.3 or newer
- OpenMPI 4.1 or newer, or MPICH 3.4 or newer
Contains:
- NVIDIA NCCL 2.22.3
- AWS OFI NCCL 1.6.0
- NVIDIA NVSHMEM 2.11
- NVIDIA GDR Copy 2.4
- NVIDIA NVPL BLAS 24.07 (Arm SBSA only)
- NVIDIA NVPL LAPACK 24.07 (Arm SBSA only)
- NVIDIA NVPL Sparse 24.07 (Arm SBSA only)
- LLVM OpenMP 18.1.1 (Arm SBSA only)
- TCMalloc 4.5.3 (Arm SBSA only)
Known issues:
- HPL out-of-core (OOC): In case of experiencing GPU out-of-memory issues with HPL OOC, consider increasing the size of the GPU memory needed for the driver (not used by HPL OOC). This can be achieved by adjusting the HPL_OOC_SAFE_SIZE environment variable. Default value is 2.0 (the buffer size in GB). Depending on the GPU/driver, you may need to increase this further to resolve memory issues.
- HPL-MxP: The input task must satisfy the following condition:
  ((N / NB) / npcol) / u-panel-chunk-nbs < 20
  
  N - size of N-by-N matrix
  
  NB - the blocking constant (panel size)
  
  npcol - number of columns in the processor grid
  
  u-panel-chunk-nbs - U panel chunk size given in unit of NBs (default 8)

Initial release
Supported CPU Architectures: x86_64, NVIDIA Grace CPU (Arm SBSA)
Supported SM Architectures: NVIDIA Ampere GPU architecture (sm80) and NVIDIA Hopper GPU architecture (sm90)
Supported OS: Linux distributions with glibc >= 2.28 – RHEL 8.8 and SLES 15.5 have been tested.
Support MPI: Libraries that are ABI-compatible with MPICH (e.g., MPICH, Cray MPICH, MVAPICH, etc.)
Prerequisites
- CUDA 12.3 or newer
- MPICH 3.4 or newer
Contains:
- NVIDIA NCCL 2.21.5
- AWS OFI NCCL 1.6.0
- NVIDIA NVSHMEM 2.11
- NVIDIA GDR Copy 2.4
- NVIDIA NVPL BLAS 24.03 (Arm SBSA only)
- NVIDIA NVPL LAPACK 24.03 (Arm SBSA only)
- LLVM OpenMP 18.1.1 (Arm SBSA only)
- TCMalloc 4.5.3 (Arm SBSA only)