Getting Started#
Introduction#
cudaMalloc()
, cudaFree()
, cudaMemcpy()
, and cudaMemcpyAsync()
.Hardware and Software Requirements#
GPU Architectures |
Compute Capability 7.0 (Volta) and above |
CUDA |
12.1.1 and above |
CPU Architectures |
x86_64, arm64-sbsa |
Operating System |
Linux |
Required Packages
CUDA Toolkit 12.1.1 and above (https://developer.nvidia.com/cuda-downloads)
HPC-X v2.16 and above (https://developer.nvidia.com/networking/hpc-x) - contains OpenUCC and OpenUCX that satisfy cuSOLVERMp requirements.
CAL (https://developer.download.nvidia.com/compute/cublasmp/redist/libcal) - a companion library used for communication.
Recommended Packages
NCCL v2.16 and above (https://developer.nvidia.com/nccl) - required to achieve good performance.
OpenUCX v1.10+ (openucx/ucx) and OpenUCC v1.1+ (openucx/ucc) - instead of HPC-X, you can install OpenUCX and OpenUCC manually. Both need to be configured with CUDA support.
GDRCopy v2.0+ (NVIDIA/gdrcopy) and nv_peer_mem (Mellanox/nv_peer_memory) - Allows underlying communication packages to use GPUDirect RDMA. If you install OpenUCX yourself - it should be configured with GDRCopy support.
Mellanox OFED (https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) - drivers for NVIDIA InfiniBand Adapters (https://www.nvidia.com/en-us/networking/products/infiniband). If you install OpenUCX yourself - it should be configured with IB communications support.
Data Layout of Local Matrices#
Workflow#
1. Bootstrap CAL communicator: cal_comm_create().2. Initialize the library handle: cusolverMpCreate().3. Initialize grid descriptors: cusolverMpCreateDeviceGrid().4. Initialize matrix descriptors: cusolverMpCreateMatrixDesc().5. Query the host and device buffer sizes for a given routine.6. Allocate host and device workspace buffers for a given routine.7. Execute the routine to perform the desired computation.8. Synchronize local stream to make sure the result is available, if required: cal_stream_sync().9. Deallocate host and device workspace.10. Destroy matrix descriptors: cusolverMpDestroyMatrixDesc().11. Destroy grid descriptors: cusolverMpDestroyGrid().12. Destroy cuSOLVERMp library handle: cusolverMpDestroy().13. Destroy CAL library handle: cal_comm_destroy().