Getting Started

Introduction

cuBLASMp aims to provide GPU-accelarated PBLAS-like basic linear algebra functionality.
cuBLASMp leverages the 2D block cyclic data layout for load balancing and to maximize compatibility with PBLAS routines.
The library assumes the data is available on the device memory. It is the responsibility of the developer to allocate memory and to copy data between GPU memory and CPU memory using standard CUDA runtime APIs, such as cudaMallocAsync(), cudaFreeAsync(), cudaMemcpyAsync(), etc.

Hardware and Software requirements

GPU Architectures

Volta (SM 7.0), Ampere (SM 8.0), Hopper (SM 9.0)

CUDA

11.8.0, 12.2.2

CPU architectures

x86_64, arm64-sbsa

Operating System

Linux

  • Recommended NVIDIA Infiniband solutions for accelerated inter-node communication

Required packages

Recommended packages

Synchronous Execution

Currently, cuBLASMp computational routines are blocking with respect to the host. Once the routine finishes it will return the control to the user and the result will be available on the device without further synchronisation required. This constraint will be relaxed in future releases.

Data Layout of Local Matrices

cuBLASMp assumes that local matrices are stored in column-major format.

Workflow

cuBLASMp’s workflow can be broken down as follows:
1. Bootstrap CAL communicator: cal_comm_create().
2. Initialize the library handle: cublasMpCreate().
3. Initialize grid descriptors: cublasMpGridCreate().
4. Initialize matrix descriptors: cublasMpMatrixDescriptorCreate().
5. Query the host and device buffer sizes for a given routine.
6. Allocate host and device workspace buffers for a given routine.
6. Execute the routine to perform the desired computation.
7. Synchronize local stream to make sure the result is available, if required: cal_stream_sync().
8. Deallocate host and device workspace.
9. Destroy matrix descriptors: cublasMpMatrixDescriptorDestroy().
10. Destroy grid descriptors: cublasMpGridDestroy().
11. Destroy cuBLASMp library handle: cublasMpDestroy().
12. Destroy CAL library handle: cal_comm_destroy().

Code Samples

Code samples can be found in the CUDALibrarySamples repository.