Getting Started¶
Introduction¶
cudaMallocAsync()
, cudaFreeAsync()
, cudaMemcpyAsync()
, etc.Hardware and Software requirements¶
GPU Architectures |
Volta ( |
CUDA |
11.8.0, 12.2.2 |
CPU architectures |
x86_64, arm64-sbsa |
Operating System |
Linux |
Recommended NVIDIA Infiniband solutions for accelerated inter-node communication
Required packages
- CUDA 11.8.0
CUDA Toolkit 11.8.0 (https://developer.nvidia.com/cuda-11-8-0-download-archive)
HPC-X v2.14 (https://developer.nvidia.com/networking/hpc-x) - contains OpenUCC and OpenUCX that satisfy cuBLASMp requirements.
NCCL v2.16.5 (https://developer.nvidia.com/nccl) - required to achieve better performance
- CUDA 12.2.2
CUDA Toolkit 12.2.2 (https://developer.nvidia.com/cuda-12-2-2-download-archive)
HPC-X v2.17.1 (https://developer.nvidia.com/networking/hpc-x) - contains OpenUCC and OpenUCX that satisfy cuBLASMp requirements.
NCCL v2.18.5 (https://developer.nvidia.com/nccl) - required to achieve better performance
Recommended packages
GDRCopy v2.0+ (https://github.com/NVIDIA/gdrcopy) and nv_peer_mem (https://github.com/Mellanox/nv_peer_memory) - Allows underlying communication packages use GPUDirect RDMA. If you install OpenUCX yourself - it should be configured with GDRCopy support.
Mellanox OFED (https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) - drivers for NVIDIA Infiniband Adapters (https://www.nvidia.com/en-us/networking/products/infiniband). If you install OpenUCX yourself - it should be configured with IB communications support.
Synchronous Execution¶
Data Layout of Local Matrices¶
Workflow¶
1. Bootstrap CAL communicator: cal_comm_create().2. Initialize the library handle: cublasMpCreate().3. Initialize grid descriptors: cublasMpGridCreate().4. Initialize matrix descriptors: cublasMpMatrixDescriptorCreate().5. Query the host and device buffer sizes for a given routine.6. Allocate host and device workspace buffers for a given routine.6. Execute the routine to perform the desired computation.7. Synchronize local stream to make sure the result is available, if required: cal_stream_sync().8. Deallocate host and device workspace.9. Destroy matrix descriptors: cublasMpMatrixDescriptorDestroy().10. Destroy grid descriptors: cublasMpGridDestroy().11. Destroy cuBLASMp library handle: cublasMpDestroy().12. Destroy CAL library handle: cal_comm_destroy().