Migrating from CAL to NCCL#
This guide describes how to migrate cuBLASMp applications from CAL (Communication Abstraction Layer) to NCCL (NVIDIA Collective Communications Library).
Overview of Changes#
The migration from CAL to NCCL involves several key changes in the initialization and synchronization patterns:
Simplified initialization - NCCL requires fewer setup steps.
Different communicator management - Direct NCCL communicator instead of CAL wrapper.
Change in grid and redistribution API calls - The type of the communicator parameter has been changed from
cal_comm_t
toncclComm_t
.Modified synchronization calls - Using CUDA streams instead of CAL-specific synchronization.
Key Differences#
Header Changes#
CAL Version:
#include <cal.h>
NCCL Version:
// NCCL headers are included automatically with cuBLASMp
Variable Declarations#
CAL Version:
cal_comm_t cal_comm = nullptr;
NCCL Version:
ncclUniqueId id;
ncclComm_t nccl_comm;
Initialization Process#
The CAL initialization process involves creating a communicator with explicit parameters:
cal_comm_t cal_comm = nullptr;
#ifdef USE_CAL_MPI
CAL_CHECK(cal_comm_create_mpi(MPI_COMM_WORLD, rank, nranks, local_device, &cal_comm));
#else
cal_comm_create_params_t params;
params.allgather = allgather;
params.req_test = request_test;
params.req_free = request_free;
params.data = (void*)(MPI_COMM_WORLD);
params.rank = rank;
params.nranks = nranks;
params.local_device = local_device;
CAL_CHECK(cal_comm_create(params, &cal_comm));
#endif
The NCCL initialization is more streamlined:
ncclUniqueId id;
if (rank == 0)
{
NCCL_CHECK(ncclGetUniqueId(&id));
}
MPI_CHECK(MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD));
ncclComm_t nccl_comm;
NCCL_CHECK(ncclCommInitRank(&nccl_comm, nranks, id, rank));
Other API changes#
cublasMpGridCreate and matrix redistribution functions (cublasMpGemr2D, cublasMpTrmr2D and corresponding cublasMpGemr2D_bufferSize, cublasMpTrmr2D_bufferSize) accept the communicator as one of its parameters. The type has been changed from cal_comm_t
to ncclComm_t
.
CAL Version:
CUBLASMP_CHECK(cublasMpGridCreate(nprow, npcol, CUBLASMP_GRID_LAYOUT_COL_MAJOR, cal_comm, &grid));
NCCL Version:
CUBLASMP_CHECK(cublasMpGridCreate(nprow, npcol, CUBLASMP_GRID_LAYOUT_COL_MAJOR, nccl_comm, &grid));
Synchronization Changes#
CAL uses specialized synchronization functions:
CAL_CHECK(cal_stream_sync(cal_comm, stream));
CAL_CHECK(cal_comm_barrier(cal_comm, stream));
NCCL relies on standard CUDA stream synchronization:
CUDA_CHECK(cudaStreamSynchronize(stream));
Cleanup Changes#
CAL:
CAL_CHECK(cal_comm_barrier(cal_comm, stream));
CAL_CHECK(cal_comm_destroy(cal_comm));
NCCL:
NCCL_CHECK(ncclCommDestroy(nccl_comm));
Migration Checklist#
Remove CAL includes - Remove CAL-related includes such as
#include <cal.h>
and#include <cal_mpi.h>
.Update variable declarations - Replace
cal_comm_t
withncclComm_t
andncclUniqueId
.Modify initialization - Replace CAL communicator creation with NCCL initialization pattern.
Update grid and redistribution API calls - Pass NCCL communicator instead of CAL communicator.
Change synchronization - Replace CAL-specific synchronization calls with
cudaStreamSynchronize
.Update cleanup - Use NCCL finalization and destruction functions.