Getting Started#
Introduction#
cudaMallocAsync(), cudaFreeAsync(), cudaMemcpyAsync(), etc.Hardware and Software Requirements#
GPU Architectures |
Compute Capability 7.5 (Turing) and above (CUDA 13) |
Compute Capability 7.0 (Volta) and above (CUDA 12) |
|
CUDA |
13.x |
12.9.0 or above recommended, 12.x compatible |
|
CPU Architectures |
x86_64, arm64-sbsa |
Operating Systems |
Linux |
Recommended NVIDIA InfiniBand solutions for accelerated inter-node communication
Required Packages
NCCL v2.29.2 and above
Note
cuBLASMp versions older than 0.8.0 also require NVSHMEM v3.3.24 and above. Starting from cuBLASMp 0.8.0, NVSHMEM is no longer required as the library uses NCCL Symmetric Memory instead.
Recommended Packages
GDRCopy v2.0+ (NVIDIA/gdrcopy) and nv_peer_mem (Mellanox/nv_peer_memory) - Allows underlying communication packages to use GPUDirect RDMA.
Mellanox OFED (https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) - drivers for NVIDIA InfiniBand Adapters.
Data Layout of Local Matrices#
Workflow#
1. Create a NCCL communicator: NCCL Initialization.2. Initialize the library handle: cublasMpCreate().3. Initialize grid descriptors: cublasMpGridCreate().4. Initialize matrix descriptors: cublasMpMatrixDescriptorCreate().5. Query the host and device buffer sizes for a given routine.6. Allocate host and device workspace buffers for a given routine.7. Execute the routine to perform the desired computation.8. Synchronize local stream to make sure the result is available, if required:cudaStreamSynchronize().9. Deallocate host and device workspace.10. Destroy matrix descriptors: cublasMpMatrixDescriptorDestroy().11. Destroy grid descriptors: cublasMpGridDestroy().12. Destroy cuBLASMp library handle: cublasMpDestroy().13. Destroy the NCCL communicator: NCCL Initialization.