cuBLASMp: A High-Performance CUDA Library for Distributed Dense Linear Algebra

NVIDIA cublasMp is a high performance, multi-process, GPU accelerated library for distributed basic dense linear algebra.

cuBLASMp is compatible with 2D block-cyclic data layout and provides PBLAS-like C APIs.

A companion library, CAL, contains utilities to manage communicators and to synchronize processes in a safe way.

Download: cuBLASMp library is available through NVIDIA Developer Zone and NVIDIA HPC SDK as early access.

Key Features

  • Multi-process, multi-GPU.

  • One process per GPU.

  • PBLAS-like C functionalities and interfaces to facilitate porting.

  • Configurable communication backends (UCC, NCCL, UCX, etc)

  • Logging and tracing.

  • Tensor-core accelerated.

Index