Getting Started#
Introduction#
cudaMalloc(), cudaMallocAsync(), cudaMemcpyAsync(), etc. For cublasMpMatmul, allocating the device workspace with NCCL symmetric memory enables the high-performance AllGather+GEMM, GEMM+ReduceScatter, and GEMM+AllReduce algorithms; without it, the library uses no_overlap (see Memory Management).Hardware and Software Requirements#
GPU Architectures |
Compute Capability 7.5 (Turing) and above (CUDA 13) |
Compute Capability 7.0 (Volta) and above (CUDA 12) |
|
CUDA |
13.x |
12.9.0 or above recommended, 12.x compatible |
|
CPU Architectures |
x86_64, arm64-sbsa |
Operating Systems |
Linux |
Recommended NVIDIA InfiniBand solutions for accelerated inter-node communication
Required Packages
NCCL v2.29.2 and above
Note
cuBLASMp versions older than 0.8.0 also require NVSHMEM v3.3.24 and above. Starting from cuBLASMp 0.8.0, NVSHMEM is no longer required as the library uses NCCL Symmetric Memory instead.
Recommended Packages
GDRCopy v2.0+ (NVIDIA/gdrcopy) and nv_peer_mem (Mellanox/nv_peer_memory) - Allows underlying communication packages to use GPUDirect RDMA.
Mellanox OFED (https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) - drivers for NVIDIA InfiniBand Adapters.
Data Layout of Local Matrices#
Workflow#
1. Create a NCCL communicator: NCCL Initialization.2. Initialize the library handle: cublasMpCreate().3. Initialize grid descriptors: cublasMpGridCreate().4. Initialize matrix descriptors: cublasMpMatrixDescriptorCreate().5. Query the host and device buffer sizes for a given routine.6. Allocate workspace buffers. For cublasMpMatmul, use cublasMpMalloc() for the device workspace to enable the communication-computation overlap in the AG+GEMM, GEMM+RS, and GEMM+AR algorithms; standardcudaMallocis also accepted and causes the library to useno_overlapor returns CUBLASMP_STATUS_NOT_SUPPORTED. For all other routines, use standardcudaMallocor any other CUDA memory allocator. The host workspace and input/output matrix buffers (A, B, C, D) always use standard allocators.7. Execute the routine to perform the desired computation.8. Synchronize local stream to make sure the result is available, if required:cudaStreamSynchronize().9. Deallocate host and device workspace.10. Destroy matrix descriptors: cublasMpMatrixDescriptorDestroy().11. Destroy grid descriptors: cublasMpGridDestroy().12. Destroy cuBLASMp library handle: cublasMpDestroy().13. Destroy the NCCL communicator: NCCL Initialization.