Getting Started#

Introduction#

cuSOLVERMp aims to provide GPU-accelerated ScaLAPACK-like tools for solving systems of linear equations and eigenvalue and singular value problems.

cuSOLVERMp leverages the 2D block-cyclic data layout for load balancing and to maximize compatibility with ScaLAPACK routines.

The library assumes data is available on the device memory. It is the responsibility of the developer to allocate memory and to copy data between GPU memory and CPU memory using standard CUDA runtime API routines, such as cudaMalloc(), cudaFree(), cudaMemcpy(), and cudaMemcpyAsync().

Hardware and Software Requirements#

GPU Architectures	Compute Capability 7.0 (Volta) and above
CUDA	12.1.1 and above
CPU Architectures	x86_64, arm64-sbsa
Operating System	Linux

Required Packages

CUDA Toolkit 12.1.1 and above (https://developer.nvidia.com/cuda-downloads)
HPC-X v2.16 and above (https://developer.nvidia.com/networking/hpc-x) - contains OpenUCC and OpenUCX that satisfy cuSOLVERMp requirements.
CAL (https://developer.download.nvidia.com/compute/cublasmp/redist/libcal) - a companion library used for communication.

Recommended Packages

NCCL v2.16 and above (https://developer.nvidia.com/nccl) - required to achieve good performance.
OpenUCX v1.10+ (openucx/ucx) and OpenUCC v1.1+ (openucx/ucc) - instead of HPC-X, you can install OpenUCX and OpenUCC manually. Both need to be configured with CUDA support.
GDRCopy v2.0+ (NVIDIA/gdrcopy) and nv_peer_mem (Mellanox/nv_peer_memory) - Allows underlying communication packages to use GPUDirect RDMA. If you install OpenUCX yourself - it should be configured with GDRCopy support.
Mellanox OFED (https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) - drivers for NVIDIA InfiniBand Adapters (https://www.nvidia.com/en-us/networking/products/infiniband). If you install OpenUCX yourself - it should be configured with IB communications support.

Data Layout of Local Matrices#

cuSOLVERMp assumes that local matrices are stored in column-major format.

Workflow#

cuSOLVERMp’s workflow can be broken down as follows:

1. Bootstrap CAL communicator: cal_comm_create().

2. Initialize the library handle: cusolverMpCreate().

3. Initialize grid descriptors: cusolverMpCreateDeviceGrid().

4. Initialize matrix descriptors: cusolverMpCreateMatrixDesc().

5. Query the host and device buffer sizes for a given routine.

6. Allocate host and device workspace buffers for a given routine.

7. Execute the routine to perform the desired computation.

8. Synchronize local stream to make sure the result is available, if required: cal_stream_sync().

9. Deallocate host and device workspace.

10. Destroy matrix descriptors: cusolverMpDestroyMatrixDesc().

11. Destroy grid descriptors: cusolverMpDestroyGrid().

12. Destroy cuSOLVERMp library handle: cusolverMpDestroy().

13. Destroy CAL library handle: cal_comm_destroy().

Code Samples#

Code samples can be found in the CUDALibrarySamples repository.