cuBLASMp: A High-Performance CUDA Library for Distributed Dense Linear Algebra#
NVIDIA cuBLASMp is a high-performance, multi-process, GPU-accelerated library for distributed basic dense linear algebra.
cuBLASMp is compatible with 2D block-cyclic data layout and provides PBLAS-like C APIs.
Download: cuBLASMp is available through NVIDIA Developer Zone, NVIDIA HPC SDK, PyPI (CUDA 12, CUDA 13) and Conda, conda-forge.
Python: NVIDIA nvmath-python provides higher-level Distributed APIs for distributed linear algebra, as well as low-level cuBLASMp bindings through nvmath.bindings.cublasMp.
Key Features#
Multi-process, multi-GPU.
One process per GPU.
PBLAS-like C functionalities and interfaces to facilitate porting.
NCCL communication backend.
Logging and tracing.
Tensor-core accelerated.