Getting Started#

Introduction#

cuBLASMp aims to provide GPU-accelerated PBLAS-like basic linear algebra functionality.

cuBLASMp leverages the 2D block-cyclic data layout for load balancing and to maximize compatibility with PBLAS routines.

The library assumes the data is available on the device memory. It is the responsibility of the developer to allocate memory and to copy data between GPU memory and CPU memory using standard CUDA runtime APIs, such as cudaMallocAsync(), cudaFreeAsync(), cudaMemcpyAsync(), etc.

Hardware and Software Requirements#

GPU Architectures	Compute Capability 7.0 (Volta) and above
CUDA	12.8 recommended, 12.x compatible
CPU Architectures	x86_64, arm64-sbsa
Operating System	Linux

Recommended NVIDIA InfiniBand solutions for accelerated inter-node communication

Required Packages

CUDA Toolkit 12.x
NVSHMEM v3.1 and above
NCCL v2.18.5 and above

Recommended Packages

GDRCopy v2.0+ (NVIDIA/gdrcopy) and nv_peer_mem (Mellanox/nv_peer_memory) - Allows underlying communication packages to use GPUDirect RDMA.
Mellanox OFED (https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) - drivers for NVIDIA InfiniBand Adapters.

Data Layout of Local Matrices#

cuBLASMp assumes that local matrices are stored in column-major format.

Workflow#

cuBLASMp’s workflow can be broken down as follows:

1. Create a NCCL communicator: NCCL Initialization.

2. Initialize the library handle: cublasMpCreate().

3. Initialize grid descriptors: cublasMpGridCreate().

4. Initialize matrix descriptors: cublasMpMatrixDescriptorCreate().

5. Query the host and device buffer sizes for a given routine.

6. Allocate host and device workspace buffers for a given routine.

7. Execute the routine to perform the desired computation.

8. Synchronize local stream to make sure the result is available, if required: cudaStreamSynchronize().

9. Deallocate host and device workspace.

10. Destroy matrix descriptors: cublasMpMatrixDescriptorDestroy().

11. Destroy grid descriptors: cublasMpGridDestroy().

12. Destroy cuBLASMp library handle: cublasMpDestroy().

13. Destroy the NCCL communicator: NCCL Initialization.

Code Samples#

Code samples can be found in the CUDALibrarySamples repository.