NCCL Initialization#

This guide covers the initialization of cuSOLVERMp using NCCL (NVIDIA Collective Communications Library) as the communication backend.

For detailed information about NCCL and its usage, please refer to the NCCL User Guide (Creating a Communicator).

Data Types#

The following NCCL-specific types are used in initialization:

ncclUniqueId id;        // Unique identifier for NCCL communicator
ncclComm_t comm;        // NCCL communicator handle

Basic Initialization Pattern#

Step 1: MPI Setup#

Initialize MPI and get rank information:

MPI_Init(nullptr, nullptr);

int rank, nranks;
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

Step 2: CUDA Device Setup#

Set up the local CUDA device:

const int local_device = getLocalDevice();
CUDA_CHECK(cudaSetDevice(local_device));
CUDA_CHECK(cudaFree(nullptr));  // Initialize CUDA context

Step 3: NCCL Communicator Creation#

Create the NCCL communicator using a unique ID:

ncclUniqueId id;

// Rank 0 generates the unique ID
if (rank == 0)
{
    NCCL_CHECK(ncclGetUniqueId(&id));
}

// Broadcast the ID to all ranks
MPI_CHECK(MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD));

// Initialize NCCL communicator
ncclComm_t comm;
NCCL_CHECK(ncclCommInitRank(&comm, nranks, id, rank));

Step 4: CUDA Stream Creation#

Create a CUDA stream for cuSOLVERMp operations:

cudaStream_t stream = nullptr;
CUDA_CHECK(cudaStreamCreate(&stream));

Step 5: cuSOLVERMp Handle Creation#

Create the cuSOLVERMp handle:

cusolverMpHandle_t handle = nullptr;
CUSOLVER_CHECK(cusolverMpCreate(&handle, local_device, stream));

Step 6: Creating cuSOLVERMp Process Grids#

Create one or more process grids for cuSOLVERMp:

cusolverMpGrid_t grid = nullptr;

CUSOLVER_CHECK(cusolverMpCreateDeviceGrid(handle, &grid, comm, nprow, npcol, CUSOLVERMP_GRID_MAPPING_COL_MAJOR));

Step 7: cuSOLVERMp API Calls#

Use cuSOLVERMp functions to perform needed linear algebra operations.

Synchronization#

Use standard CUDA stream synchronization for operations:

CUDA_CHECK(cudaStreamSynchronize(stream));

Cleanup#

Proper cleanup is required for NCCL-based applications:

// Destroy cuSOLVERMp objects first
CUSOLVER_CHECK(cusolverMpDestroyGrid(grid));
CUSOLVER_CHECK(cusolverMpDestroy(handle));

// Destroy NCCL communicator
NCCL_CHECK(ncclCommDestroy(comm));

// Clean up CUDA resources
CUDA_CHECK(cudaStreamDestroy(stream));

// Finalize MPI
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();