Release Notes#
v25.09#
Added support for CUDA 13 on devices with Compute Capability 8.0 (Ampere) and above.
FP4 support added for the NVIDIA HPL-MxP Benchmark.
The NVIDIA HPC Benchmarks package v25.09 includes microbenchmarks designed to assess system readiness before running large-scale benchmarks.
GEMM (matrix-matrix multiplication) benchmark
Contains:
NVIDIA NCCL 2.27.7
AWS OFI NCCL 1.6.0
NVIDIA NVSHMEM 3.4.5
NVIDIA NVPL BLAS 25.1 (Arm SBSA only)
NVIDIA NVPL LAPACK 25.1 (Arm SBSA only)
NVIDIA NVPL Sparse 25.1 (Arm SBSA only)
Known issues:
Performance of the NVIDIA HPL-MxP Benchmark highly depends on NVIDIA cuBLAS library. The NVIDIA cuBLAS library from CUDA Toolkit 13 Update 1 or newer substantially improves the performance of FP4 GEMM.
Performance of the NVIDIA HPCG Benchmark highly depends on NVIDIA cuSPARSE library. The NVIDIA cuSPARSE library from CUDA Toolkit 13 Update 1 or newer improves the performance of HPCG Benchmark.
NVSHMEM 3.4.5 has a known issue with MPICH when running HPL, so it is disabled by default in both the MPICH x86 and SBSA releases; to enable it, add
export HPL_USE_NVSHMEM=1inside the hpc-benchmarks-gpu-env.sh script and use an older NVSHMEM version (e.g., 3.2.5).
v25.04#
Added support for FP64 Emulation for HPL
Contains:
NVIDIA NCCL 2.25.1
AWS OFI NCCL 1.6.0
NVIDIA NVSHMEM 3.2.5
NVIDIA GDR Copy 2.4
NVIDIA NVPL BLAS 25.1 (Arm SBSA only)
NVIDIA NVPL LAPACK 25.1 (Arm SBSA only)
NVIDIA NVPL Sparse 25.1 (Arm SBSA only)
LLVM OpenMP 18.1.1 (Arm SBSA only)
TCMalloc 4.5.3 (Arm SBSA only)
- Known issues:
If NVSHMEM is used in the HPL Benchmark and is initialized using a unique ID (UID), the benchmark may hang during a multi-node run. To workaround this issue, initialize NVSHMEM using MPI
export HPL_NVSHMEM_INIT=0or disable NVSHMEMexport HPL_USE_NVSHMEM=0.
v25.02#
Added support for NVIDIA Blackwell GPU architecture (sm100)
Added support for Linux Ubuntu 24.04
Prerequisites
CUDA 12.8 or newer
OpenMPI 4.1 or newer, or MPICH 3.4 or newer
Contains:
NVIDIA NCCL 2.25.1
AWS OFI NCCL 1.6.0
NVIDIA NVSHMEM 3.2.5
NVIDIA GDR Copy 2.4
NVIDIA NVPL BLAS 24.07 (Arm SBSA only)
NVIDIA NVPL LAPACK 24.07 (Arm SBSA only)
NVIDIA NVPL Sparse 24.07 (Arm SBSA only)
LLVM OpenMP 18.1.1 (Arm SBSA only)
TCMalloc 4.5.3 (Arm SBSA only)
Known issues:
HPCX 2.21 is known to have a long startup time on Blackwell. Enabling the compute cache (
export CUDA_CACHE_DISABLE=0) can help reduce this delay.
v24.09#
Added support for OpenMPI 4.1 or newer
Added support for Linux Ubuntu 22.04
Prerequisites
CUDA 12.3 or newer
OpenMPI 4.1 or newer, or MPICH 3.4 or newer
Contains:
NVIDIA NCCL 2.22.3
AWS OFI NCCL 1.6.0
NVIDIA NVSHMEM 2.11
NVIDIA GDR Copy 2.4
NVIDIA NVPL BLAS 24.07 (Arm SBSA only)
NVIDIA NVPL LAPACK 24.07 (Arm SBSA only)
NVIDIA NVPL Sparse 24.07 (Arm SBSA only)
LLVM OpenMP 18.1.1 (Arm SBSA only)
TCMalloc 4.5.3 (Arm SBSA only)
Known issues:
HPL out-of-core (OOC): In case of experiencing GPU out-of-memory issues with HPL OOC, consider increasing the size of the GPU memory needed for the driver (not used by HPL OOC). This can be achieved by adjusting the
HPL_OOC_SAFE_SIZEenvironment variable. Default value is2.0(the buffer size in GB). Depending on the GPU/driver, you may need to increase this further to resolve memory issues.HPL-MxP: The input task must satisfy the following condition:
((N / NB) / npcol) / u-panel-chunk-nbs < 20
N- size of N-by-N matrixNB- the blocking constant (panel size)npcol- number of columns in the processor gridu-panel-chunk-nbs- U panel chunk size given in unit of NBs (default 8)
v24.05#
Initial release
Supported CPU Architectures: x86_64, NVIDIA Grace CPU (Arm SBSA)
Supported SM Architectures: NVIDIA Ampere GPU architecture (sm80) and NVIDIA Hopper GPU architecture (sm90)
Supported OS: Linux distributions with glibc >= 2.28 – RHEL 8.8 and SLES 15.5 have been tested.
Support MPI: Libraries that are ABI-compatible with MPICH (e.g., MPICH, Cray MPICH, MVAPICH, etc.)
Prerequisites
CUDA 12.3 or newer
MPICH 3.4 or newer
Contains:
NVIDIA NCCL 2.21.5
AWS OFI NCCL 1.6.0
NVIDIA NVSHMEM 2.11
NVIDIA GDR Copy 2.4
NVIDIA NVPL BLAS 24.03 (Arm SBSA only)
NVIDIA NVPL LAPACK 24.03 (Arm SBSA only)
LLVM OpenMP 18.1.1 (Arm SBSA only)
TCMalloc 4.5.3 (Arm SBSA only)