Migrating from CAL to NCCL#

This guide describes how to migrate cuBLASMp applications from CAL (Communication Abstraction Layer) to NCCL (NVIDIA Collective Communications Library).

Overview of Changes#

The migration from CAL to NCCL involves several key changes in the initialization and synchronization patterns:

  • Simplified initialization - NCCL requires fewer setup steps.

  • Different communicator management - Direct NCCL communicator instead of CAL wrapper.

  • Change in grid and redistribution API calls - The type of the communicator parameter has been changed from cal_comm_t to ncclComm_t.

  • Modified synchronization calls - Using CUDA streams instead of CAL-specific synchronization.

Key Differences#

Header Changes#

CAL Version:

#include <cal.h>

NCCL Version:

// NCCL headers are included automatically with cuBLASMp

Variable Declarations#

CAL Version:

cal_comm_t cal_comm = nullptr;

NCCL Version:

ncclUniqueId id;
ncclComm_t nccl_comm;

Initialization Process#

The CAL initialization process involves creating a communicator with explicit parameters:

cal_comm_t cal_comm = nullptr;
#ifdef USE_CAL_MPI
    CAL_CHECK(cal_comm_create_mpi(MPI_COMM_WORLD, rank, nranks, local_device, &cal_comm));
#else
    cal_comm_create_params_t params;
    params.allgather = allgather;
    params.req_test = request_test;
    params.req_free = request_free;
    params.data = (void*)(MPI_COMM_WORLD);
    params.rank = rank;
    params.nranks = nranks;
    params.local_device = local_device;
    CAL_CHECK(cal_comm_create(params, &cal_comm));
#endif

The NCCL initialization is more streamlined:

ncclUniqueId id;

if (rank == 0)
{
    NCCL_CHECK(ncclGetUniqueId(&id));
}

MPI_CHECK(MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD));

ncclComm_t nccl_comm;
NCCL_CHECK(ncclCommInitRank(&nccl_comm, nranks, id, rank));

Other API changes#

cublasMpGridCreate and matrix redistribution functions (cublasMpGemr2D, cublasMpTrmr2D and corresponding cublasMpGemr2D_bufferSize, cublasMpTrmr2D_bufferSize) accept the communicator as one of its parameters. The type has been changed from cal_comm_t to ncclComm_t.

CAL Version:

CUBLASMP_CHECK(cublasMpGridCreate(nprow, npcol, CUBLASMP_GRID_LAYOUT_COL_MAJOR, cal_comm, &grid));

NCCL Version:

CUBLASMP_CHECK(cublasMpGridCreate(nprow, npcol, CUBLASMP_GRID_LAYOUT_COL_MAJOR, nccl_comm, &grid));

Synchronization Changes#

CAL uses specialized synchronization functions:

CAL_CHECK(cal_stream_sync(cal_comm, stream));
CAL_CHECK(cal_comm_barrier(cal_comm, stream));

NCCL relies on standard CUDA stream synchronization:

CUDA_CHECK(cudaStreamSynchronize(stream));

Cleanup Changes#

CAL:

CAL_CHECK(cal_comm_barrier(cal_comm, stream));
CAL_CHECK(cal_comm_destroy(cal_comm));

NCCL:

NCCL_CHECK(ncclCommDestroy(nccl_comm));

Migration Checklist#

  1. Remove CAL includes - Remove CAL-related includes such as #include <cal.h> and #include <cal_mpi.h>.

  2. Update variable declarations - Replace cal_comm_t with ncclComm_t and ncclUniqueId.

  3. Modify initialization - Replace CAL communicator creation with NCCL initialization pattern.

  4. Update grid and redistribution API calls - Pass NCCL communicator instead of CAL communicator.

  5. Change synchronization - Replace CAL-specific synchronization calls with cudaStreamSynchronize.

  6. Update cleanup - Use NCCL finalization and destruction functions.