NCCL Initialization#

This guide covers the initialization of cuBLASMp using NCCL (NVIDIA Collective Communications Library) as the communication backend.

For detailed information about NCCL and its usage, please refer to the NCCL User Guide (Creating a Communicator).

Data Types#

The following NCCL-specific types are used in initialization:

ncclUniqueId id;        // Unique identifier for NCCL communicator
ncclComm_t comm;        // NCCL communicator handle

Basic Initialization Pattern#

Step 1: MPI Setup#

Initialize MPI and get rank information:

MPI_Init(nullptr, nullptr);

int rank, nranks;
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

Step 2: CUDA Device Setup#

Set up the local CUDA device:

const int local_device = getLocalDevice();
CUDA_CHECK(cudaSetDevice(local_device));
CUDA_CHECK(cudaFree(nullptr));  // Initialize CUDA context

Step 3: NCCL Communicator Creation#

Create the NCCL communicator using a unique ID:

ncclUniqueId id;

// Rank 0 generates the unique ID
if (rank == 0)
{
    NCCL_CHECK(ncclGetUniqueId(&id));
}

// Broadcast the ID to all ranks
MPI_CHECK(MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD));

// Initialize NCCL communicator
ncclComm_t comm;
NCCL_CHECK(ncclCommInitRank(&comm, nranks, id, rank));

Step 4: CUDA Stream Creation#

Create a CUDA stream for cuBLASMp operations:

cudaStream_t stream = nullptr;
CUDA_CHECK(cudaStreamCreate(&stream));

Step 5: cuBLASMp Handle Creation#

Create the cuBLASMp handle:

cublasMpHandle_t handle = nullptr;
CUBLASMP_CHECK(cublasMpCreate(&handle, stream));

Step 6: Creating cuBLASMp Process Grids#

Create one or more process grids for cuBLASMp:

cublasMpGrid_t grid = nullptr;

CUBLASMP_CHECK(cublasMpGridCreate(nprow, npcol, CUBLASMP_GRID_LAYOUT_COL_MAJOR, comm, &grid));

Step 7: cuBLASMp API Calls#

Use cuBLASMp functions to perform needed matrix operations.

Synchronization#

Use standard CUDA stream synchronization for operations:

CUDA_CHECK(cudaStreamSynchronize(stream));

Cleanup#

Proper cleanup is required for NCCL-based applications:

// Destroy cuBLASMp objects first
CUBLASMP_CHECK(cublasMpGridDestroy(grid));
CUBLASMP_CHECK(cublasMpDestroy(handle));

// Destroy NCCL communicator
NCCL_CHECK(ncclCommDestroy(comm));

// Clean up CUDA resources
CUDA_CHECK(cudaStreamDestroy(stream));

// Finalize MPI
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();