Communication abstraction library API and data types

Communication abstraction library is a helper module for cuSolverMP library and helps to set up communications between different GPUs. cuSolverMP API accepts cal_comm_t communicator object and requires it to be created prior to any cuSolverMP call. At this moment communicator object cal_comm_t can be initalized only with MPI communicator with some of the communications routines using underlying MPI library, thus requiring MPI to be initialized for the duration of the library usage. At this moment library’s communications support only the case where each participating process only uses single GPU and each participating GPU can only be used by one process.


Communication abstraction library setup

Currently communications are based on system (shared memory), CUDA runtime (interprocess memory copies and CUDA events), NCCL (for collectives calls) and MPI (if there are no other variants) libraries. This module will initialize underlying structures required for communications and setup will return error in case there are any issues. You can use MPI diagnostics (refer to your MPI distribution documentation to set verbosity level for MPI), NCCL diagnostics (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) and (this module diagnostics) to understand possible failure reason. cusolverMp library tries to use most efficient way to communicate between GPUs, however user provided environment variables and settings (i.e. MCA MPI parameters or NCCL environment variables) can affect functionality and performance of communication module.


Using communication module

There are few environment variables that can change communication module behaviour (in addition to underlying MPI and NCCL environment variables).

Variable Description
CAL_LOG_LEVEL Verbosity level of communication module with 0 means no output and 6 means maximum amount of details. Default - 0
CAL_ALLOW_SET_PEER_ACCESS If 1 - allows enabling p2p access between pairs of GPUs on the same node. Default - 0

Communication abstraction library data types

calError_t

Return values from communication abstraction library APIs. The values are described in the table below:

Value Description
CAL_OK Success.
CAL_ERROR Generic error.
CAL_ERROR_INVALID_PARAMETER Invalid parameter to the interface function.
CAL_ERROR_INTERNAL Invalid error.
CAL_ERROR_CUDA Error in CUDA runtime or driver API.
CAL_ERROR_MPI Error in MPI call.
CAL_ERROR_IPC Error in system IPC communication call.
CAL_ERROR_NOT_SUPPORTED Requested configuration or parameters are not supported.

cal_comm_t

The cal_comm_t stores device endpoint and resources related to communication.
It must be created and destroyed using cal_comm_create_dist() and cal_comm_destroy() functions respectively.

Communication abstraction library API

cal_comm_create_distr

calError_t cal_comm_create_distr(
    void* mpi_comm,
    int local_device,
    cal_comm_t* new_comm)
Single communicator initialization, uses MPI to create communication channels between GPUs. Allows using only one GPU per MPI rank and different MPI ranks can’t use same GPU. This API is a collective call with respect of host - all participating processes will synchronize in this API. local_device device ID from CUDA Runtime enumeration will be assigned to new communicator and used for all following operations this communicator. Total number of processes in the communicator will be equal to number of ranks in mpi_comm.
Parameter Description
mpi_comm Pointer to MPI Communicator that will be used for communicator setup.
local_device Local device id that will be assigned to new communicator. Should be same as device of active context.
new_comm Pointer where to store new communicator handle.

See calError_t for the description of the return value.


cal_comm_destroy

calError_t cal_comm_destroy(
    cal_comm_t comm)
Releases resources associated with provided communicator handle
Parameter Description
comm Communicator handle to release.

See calError_t for the description of the return value.


cal_stream_sync

calError_t cal_stream_sync(
    cal_comm_t comm,
    cudaStream_t stream)
Blocks calling thread until all of the outstanding device operations are finished in stream. This includes outstanding communication operations submitted to this stream. Use this function in place of cudaStreamSynchronize to progress possible outstanding communication operations.
Parameter Description
comm Communicator handle.
stream CUDA stream to synchronize.

See calError_t for the description of the return value.


cal_get_comm_size

calError_t cal_get_comm_size(
    cal_comm_t comm,
    int* size )
Retrieve number of processing elements in the provided communicator.
Parameter Description
comm Communicator handle.
size Number of processing elements.

See calError_t for the description of the return value.


cal_get_rank

calError_t cal_get_rank(
    cal_comm_t comm,
    int* rank )
Retrieve processing element rank assigned to communicator (base-0).
Parameter Description
comm Communicator handle.
rank Rank Id of the caller process.

See calError_t for the description of the return value.