Communication abstraction library API and data types¶
Communication abstraction library is a helper module for cuSolverMP library and helps to set up communications between different GPUs. cuSolverMP API accepts cal_comm_t communicator object and requires it to be created prior to any cuSolverMP call. At this moment communicator object cal_comm_t can be initalized only with MPI communicator with some of the communications routines using underlying MPI library, thus requiring MPI to be initialized for the duration of the library usage. At this moment library’s communications support only the case where each participating process only uses single GPU and each participating GPU can only be used by one process.
Communication abstraction library setup¶
Currently communications are based on system (shared memory), CUDA runtime (interprocess memory copies and CUDA events), NCCL (for collectives calls) and MPI (if there are no other variants) libraries. This module will initialize underlying structures required for communications and setup will return error in case there are any issues. You can use MPI diagnostics (refer to your MPI distribution documentation to set verbosity level for MPI), NCCL diagnostics (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) and (this module diagnostics) to understand possible failure reason. cusolverMp library tries to use most efficient way to communicate between GPUs, however user provided environment variables and settings (i.e. MCA MPI parameters or NCCL environment variables) can affect functionality and performance of communication module.
Using communication module¶
There are few environment variables that can change communication module behaviour (in addition to underlying MPI and NCCL environment variables).
Variable | Description |
---|---|
CAL_LOG_LEVEL | Verbosity level of communication module with 0 means no output and 6 means maximum amount of details. Default - 0 |
CAL_ALLOW_SET_PEER_ACCESS | If 1 - allows enabling p2p access between pairs of GPUs on the same node. Default - 0 |
Communication abstraction library data types¶
calError_t
¶
Return values from communication abstraction library APIs. The values are described in the table below:
Value | Description |
---|---|
CAL_OK | Success. |
CAL_ERROR | Generic error. |
CAL_ERROR_INVALID_PARAMETER | Invalid parameter to the interface function. |
CAL_ERROR_INTERNAL | Invalid error. |
CAL_ERROR_CUDA | Error in CUDA runtime or driver API. |
CAL_ERROR_MPI | Error in MPI call. |
CAL_ERROR_IPC | Error in system IPC communication call. |
CAL_ERROR_NOT_SUPPORTED | Requested configuration or parameters are not supported. |
cal_comm_t
¶
Thecal_comm_t
stores device endpoint and resources related to communication.It must be created and destroyed using cal_comm_create_dist() and cal_comm_destroy() functions respectively.
Communication abstraction library API¶
cal_comm_create_distr
¶
calError_t cal_comm_create_distr(
void* mpi_comm,
int local_device,
cal_comm_t* new_comm)
local_device
device ID from CUDA Runtime enumeration will be assigned to new communicator and used for all following operations this communicator. Total number of processes in the communicator will be equal to number of ranks in mpi_comm.Parameter | Description |
---|---|
mpi_comm | Pointer to MPI Communicator that will be used for communicator setup. |
local_device | Local device id that will be assigned to new communicator. Should be same as device of active context. |
new_comm | Pointer where to store new communicator handle. |
See calError_t for the description of the return value.
cal_comm_destroy
¶
calError_t cal_comm_destroy(
cal_comm_t comm)
Parameter | Description |
---|---|
comm | Communicator handle to release. |
See calError_t for the description of the return value.
cal_stream_sync
¶
calError_t cal_stream_sync(
cal_comm_t comm,
cudaStream_t stream)
stream
. This includes outstanding communication operations submitted to this stream. Use this function in place of cudaStreamSynchronize
to progress possible outstanding communication operations.Parameter | Description |
---|---|
comm | Communicator handle. |
stream | CUDA stream to synchronize. |
See calError_t for the description of the return value.
cal_get_comm_size
¶
calError_t cal_get_comm_size(
cal_comm_t comm,
int* size )
Parameter | Description |
---|---|
comm | Communicator handle. |
size | Number of processing elements. |
See calError_t for the description of the return value.
cal_get_rank
¶
calError_t cal_get_rank(
cal_comm_t comm,
int* rank )
Parameter | Description |
---|---|
comm | Communicator handle. |
rank | Rank Id of the caller process. |
See calError_t for the description of the return value.