Communication Abstraction Library usage

Communication Abstraction Library (CAL) is a helper module for the cuBLASMp library that allows it to efficiently perform communications between different GPUs . The cuBLASMp grid creation API accepts cal_comm_t communicator object and requires it to be created prior to any cuBLASMp call. As for now, CAL supports only the use-case where each participating process uses single GPU and each participating GPU can only be used by a single process.


Communication abstraction library

Communications description

In order to initalize communicator handle cal_comm_t you would need to follow bootstrapping process - see respective cal_comm_create() function or example how to create communicator handle with MPI.

The main communication backend used by cuBLASMp is modular OpenUCC library. OpenUCC transports modules (such as OpenUCX, NCCL, and others) that will be used at runtime and their configuration can be controlled in various ways (i.e. environment variables that affects NCCL: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html, environment variables that affects OpenUCX: https://openucx.readthedocs.io/en/master/faq.html) - refer to your platform provider if there are known optimized settings for these dependencies. Otherwise cuBLASMp and UCC will use default values.

Based on the nature of underlying communications there are few restrictions to keep in mind when using cuBLASMp:

  • Only one cuBLASMp routine can be in the fly at any point of time in the process. It is possible, however, to create and keep multiple communications/cuBLASMp handles, but it’s a user’s responsibility to ensure that execution of cuBLASMp routines do not overlap.

  • If you are using NCCL communication library, NCCL collective calls should not overlap with cuBLASMp calls to avoid possible deadlocks.

Configuring communications submodule

There are few environment variables that can change communication module behaviour.

Variable

Description

CAL_LOG_LEVEL

Verbosity level of communication module with 0 means no output and 6 means maximum amount of details. Default - 0

UCC_CONFIG_FILE

Custom config file for UCC library. It can control underlying transports and collective parameters. Refer to UCC documentation for details on configuration syntax. Default - built-in cuBLASMp UCC configuration.

By default cuBLASMp will use UCC configuration file provided in the release package: share/ucc.conf, library will load it using path relative to the cublasMp shared object location.


Creating communicator handle with MPI

If your application uses Message Passing Interface (MPI) for distributed communications, here how it can be used to bootstrap Communication abstraction library communicator:

#include "cal.h"
#include "cublasMp.h"
#include "mpi.h"

calError_t allgather(void* src_buf, void* recv_buf, size_t size, void* data, void** request)
{
    MPI_Request req;
    int err = MPI_Iallgather(src_buf, size, MPI_BYTE, recv_buf, size, MPI_BYTE, (MPI_Comm)data, &req);
    if (err != MPI_SUCCESS)
    {
        return CAL_ERROR;
    }
    *request = (void*)req;
    return CAL_OK;
}

calError_t request_test(void* request)
{
    MPI_Request req = (MPI_Request)request;
    int         completed;
    int         err = MPI_Test(&req, &completed, MPI_STATUS_IGNORE);
    if (err != MPI_SUCCESS)
    {
        return CAL_ERROR;
    }
    return completed ? CAL_OK : CAL_ERROR_INPROGRESS;
}

calError_t request_free(void* request)
{
    return CAL_OK;
}

calError_t cal_comm_create_mpi(MPI_Comm mpi_comm, int rank, int nranks, int local_device, cal_comm_t* comm)
{
    cal_comm_create_params_t params;
    params.allgather = allgather;
    params.req_test = request_test;
    params.req_free = request_free;
    params.data = (void*)mpi_comm;
    params.rank = rank;
    params.nranks = nranks;
    params.local_device = local_device;
    return cal_comm_create(params, comm);
}

void main()
{
    // Initialize MPI, create some distribute data
    MPI_Init(...);
    int rank, size;
    MPI_Comm mpi_comm = MPI_COMM_WORLD;
    MPI_Comm_rank(mpi_comm, &rank);
    MPI_Comm_size(mpi_comm, &size);

    cal_comm_t cal_comm;

    // creating communicator handle with MPI communicator
    cal_comm_create_mpi(mpi_comm, rank, size, &cal_comm);

    // using cuBLASMp with cal_comm handle
    cublasMpCreateDeviceGrid(..., cal_comm, ...);

    // destroying communicator handle
    cal_comm_destroy(cal_comm);

    MPI_Finalize();
}

For convenience, these helper functions with MPI are also provided in source form in the release package in src folder.

Communication abstraction library data types

calError_t

Return values from communication abstraction library APIs. The values are described in the table below:

Value

Description

CAL_OK

Success.

CAL_ERROR

Generic error.

CAL_ERROR_INVALID_PARAMETER

Invalid parameter to the interface function.

CAL_ERROR_INTERNAL

Invalid error.

CAL_ERROR_CUDA

Error in CUDA runtime or driver API.

CAL_ERROR_IPC

Error in system IPC communication call.

CAL_ERROR_UCC

Error in UCC call.

CAL_ERROR_NOT_SUPPORTED

Requested configuration or parameters are not supported.

CAL_ERROR_BACKEND

Error in general backend dependency, run with verbose log level to see detailed error message

CAL_ERROR_INPROGRESS

Operation is still in progress


cal_comm_t

The cal_comm_t stores device endpoint and resources related to communication. It must be created and destroyed using cal_comm_create() and cal_comm_destroy() functions respectively.

cal_comm_create_params_t

typedef struct cal_comm_create_params
{
    calError_t (*allgather)(void* src_buf, void* recv_buf, size_t size, void* data, void** request);
    calError_t (*req_test)(void* request);
    calError_t (*req_free)(void* request);
    void* data;
    int   nranks;
    int   rank;
    int   local_device;
} cal_comm_create_params_t;
The cal_comm_create_params_t structure is a parameter to communication module creation function. This structure must be filled by the user prior to calling cal_comm_create(). Description of the fields for this structure:

Field

Description

allgather

Pointer to function that implements allgather functionality on the host memory. This function can be asynchronous with respect to the host - in this case function should create handler that can be addressed by respective req_test and req_free functions. Pointer to this handler should be written to the request parameter

req_test

If allgather function is asynchronous, this function will be used to query whether or not data was exchanged and can be used by communicator. Should return 0 if exchange was finished and CAL_ERROR_INPROGRESS otherwise

req_free

If allgather function is asynchronous, this function will be used after the data exchange by allgather function was finished to free resources used by request handle

data

Pointer to additional data that will be provided to allgather function at the time of the call

nranks

Number of ranks participating in the communicator that will be created

rank

Rank that will be assigned to the caller process in the new communicator. Should be the number between 0 and nranks

local_device

Local device that will be used by the cublasMp using this communicator. Note that user should create device context prior to using this device in CAL or cusovlerMp calls.


Communication abstraction library API

cal_comm_create

calError_t cal_comm_create(
    cal_comm_create_params_t params,
    cal_comm_t* new_comm)
Handler created with this function is required for using cuBLASMp API. Note that user should create device context for the device specified in create parameters prior to using in CAL or cuBLASMp calls. Easiest way to ensure that device context is created is to call cudaSetDevice(device_id); cudaFree(0). See cal_comm_create_params_t documentation for instructions on how to fill this structure. You can see how CAL communicator can be created in the MPI example or in the cuBLASMp samples.

Parameter

Description

params

A struct with parameters required for initializing the communicator. See cal_comm_create_params_t.

new_comm

Pointer where to store new communicator handle.

See calError_t for the description of the return value.


cal_comm_destroy

calError_t cal_comm_destroy(
    cal_comm_t comm)
Releases resources associated with provided communicator handle

Parameter

Description

comm

Communicator handle to release.

See calError_t for the description of the return value.


cal_stream_sync

calError_t cal_stream_sync(
    cal_comm_t comm,
    cudaStream_t stream)
Blocks calling thread until all of the outstanding device operations are finished in stream. Use this function in place of cudaStreamSynchronize in order to progress possible outstanding communication operations for the communicator.

Parameter

Description

comm

Communicator handle.

stream

CUDA stream to synchronize.

See calError_t for the description of the return value.


cal_get_comm_size

calError_t cal_get_comm_size(
    cal_comm_t comm,
    int* size )
Retrieve number of processing elements in the provided communicator.

Parameter

Description

comm

Communicator handle.

size

Number of processing elements.

See calError_t for the description of the return value.


cal_get_rank

calError_t cal_get_rank(
    cal_comm_t comm,
    int* rank )
Retrieve processing element rank assigned to communicator (base-0).

Parameter

Description

comm

Communicator handle.

rank

Rank Id of the caller process.

See calError_t for the description of the return value.