Getting Started#

Introduction#

cuBLASMp aims to provide GPU-accelerated PBLAS-like basic linear algebra functionality.
cuBLASMp leverages the 2D block-cyclic data layout for load balancing and to maximize compatibility with PBLAS routines.
The library assumes the data is available in device memory. It is the responsibility of the developer to allocate memory and to copy data between GPU memory and CPU memory using standard CUDA runtime APIs, such as cudaMalloc(), cudaMallocAsync(), and cudaMemcpyAsync(). For cublasMpMatmul, allocating the device workspace with NCCL symmetric memory, or registering a compatible allocation with cublasMpBufferRegister, enables the high-performance AllGather+GEMM, GEMM+ReduceScatter, and GEMM+AllReduce algorithms; without it, the library uses no_overlap (see Memory Management).
Python users can access cuBLASMp through nvmath-python, either via the higher-level Distributed APIs or the low-level nvmath.bindings.cublasMp module.

Hardware and Software Requirements#

GPU Architectures

Compute Capability 7.5 (Turing) and above (CUDA 13)

Compute Capability 7.0 (Volta) and above (CUDA 12)

CUDA

13.x

12.9.0 or above recommended, 12.x compatible

CPU Architectures

x86_64, arm64-sbsa

Operating Systems

Linux

  • Recommended NVIDIA InfiniBand solutions for accelerated inter-node communication

Required Packages

Note

cuBLASMp versions older than 0.8.0 also require NVSHMEM v3.3.24 and above. Starting from cuBLASMp 0.8.0, NVSHMEM is no longer required as the library uses NCCL Symmetric Memory instead.

Recommended Packages

Data Layout of Local Matrices#

cuBLASMp assumes that local matrices are stored in column-major format.

Workflow#

cuBLASMp’s workflow can be broken down as follows:
1. Create a NCCL communicator: NCCL Initialization.
2. Initialize the library handle: cublasMpCreate().
3. Initialize grid descriptors: cublasMpGridCreate(). The NCCL communicator passed to this call must contain exactly nprow * npcol ranks.
4. Initialize matrix descriptors: cublasMpMatrixDescriptorCreate().
5. Query the host and device buffer sizes for a given routine.
6. Allocate workspace buffers. For cublasMpMatmul, use cublasMpMalloc() or allocate with ncclMemAlloc / CUDA VMM and register the allocation with cublasMpBufferRegister() to enable the communication-computation overlap in the AG+GEMM, GEMM+RS, and GEMM+AR algorithms. Standard cudaMalloc is also accepted, but then the library either uses no_overlap or returns CUBLASMP_STATUS_NOT_SUPPORTED for explicit pipelined algorithm requests. For all other routines, use standard cudaMalloc or any other CUDA memory allocator. The host workspace and input/output matrix buffers (A, B, C, D) always use standard allocators.
7. Execute the routine to perform the desired computation.
8. Synchronize local stream to make sure the result is available, if required: cudaStreamSynchronize().
9. Deallocate host and device workspace.
10. Destroy matrix descriptors: cublasMpMatrixDescriptorDestroy().
11. Destroy grid descriptors: cublasMpGridDestroy().
12. Destroy cuBLASMp library handle: cublasMpDestroy().
13. Destroy the NCCL communicator: NCCL Initialization.

Code Samples#

Code samples can be found in the CUDALibrarySamples repository.