Getting Started#
Introduction#
cudaMalloc(), cudaMallocAsync(), and cudaMemcpyAsync(). For cublasMpMatmul, allocating the device workspace with NCCL symmetric memory, or registering a compatible allocation with cublasMpBufferRegister, enables the high-performance AllGather+GEMM, GEMM+ReduceScatter, and GEMM+AllReduce algorithms; without it, the library uses no_overlap (see Memory Management).Hardware and Software Requirements#
GPU Architectures |
Compute Capability 7.5 (Turing) and above (CUDA 13) |
Compute Capability 7.0 (Volta) and above (CUDA 12) |
|
CUDA |
13.x |
12.9.0 or above recommended, 12.x compatible |
|
CPU Architectures |
x86_64, arm64-sbsa |
Operating Systems |
Linux |
Recommended NVIDIA InfiniBand solutions for accelerated inter-node communication
Required Packages
NCCL v2.29.2 and above
Note
cuBLASMp versions older than 0.8.0 also require NVSHMEM v3.3.24 and above. Starting from cuBLASMp 0.8.0, NVSHMEM is no longer required as the library uses NCCL Symmetric Memory instead.
Recommended Packages
GDRCopy v2.0+ (NVIDIA/gdrcopy) and nv_peer_mem (Mellanox/nv_peer_memory) - Allows underlying communication packages to use GPUDirect RDMA.
Mellanox OFED (https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) - drivers for NVIDIA InfiniBand Adapters.
Data Layout of Local Matrices#
Workflow#
1. Create a NCCL communicator: NCCL Initialization.2. Initialize the library handle: cublasMpCreate().3. Initialize grid descriptors: cublasMpGridCreate(). The NCCL communicator passed to this call must contain exactlynprow * npcolranks.4. Initialize matrix descriptors: cublasMpMatrixDescriptorCreate().5. Query the host and device buffer sizes for a given routine.6. Allocate workspace buffers. For cublasMpMatmul, use cublasMpMalloc() or allocate withncclMemAlloc/ CUDA VMM and register the allocation with cublasMpBufferRegister() to enable the communication-computation overlap in the AG+GEMM, GEMM+RS, and GEMM+AR algorithms. StandardcudaMallocis also accepted, but then the library either usesno_overlapor returnsCUBLASMP_STATUS_NOT_SUPPORTEDfor explicit pipelined algorithm requests. For all other routines, use standardcudaMallocor any other CUDA memory allocator. The host workspace and input/output matrix buffers (A, B, C, D) always use standard allocators.7. Execute the routine to perform the desired computation.8. Synchronize local stream to make sure the result is available, if required:cudaStreamSynchronize().9. Deallocate host and device workspace.10. Destroy matrix descriptors: cublasMpMatrixDescriptorDestroy().11. Destroy grid descriptors: cublasMpGridDestroy().12. Destroy cuBLASMp library handle: cublasMpDestroy().13. Destroy the NCCL communicator: NCCL Initialization.