User Buffer Registration

User Buffer Registration is a feature that allows NCCL to directly send/receive/operate data through the user buffer without extra internal copy (zero-copy). It can accelerate collectives and greatly reduce the resource usage (e.g. #channel usage). NCCL provides two ways to register user buffers; one is CUDA Graph registration, and the other is Local registration. NCCL requires that for all NCCL communication function calls (e.g., allreduce, sendrecv, and so on), if any rank in a communicator passes registered buffers to a NCCL communication function, all other ranks in the same communicator must pass their registered buffers; otherwise, mixing registered and non-registered buffers can result in undefined behavior.

IB Sharp Buffer Registration

NCCL 2.21.x supports IB Sharp buffer registration, any NCCL collectives that support IB Sharp algorithm can benefit from the feature such as allreduce, reducescatter, and allgather. Currently, NCCL only supports IB Sharp buffer registration for the communicators which contain 1 rank per node, and the registration can reduce the number of NCCL SM usage down to 1.

To enable IB Sharp buffer registration by CUDA graph:

  • Allocate send and recv buffer with any CUDA allcator (e.g., cudaMalloc/ncclMemAlloc)
  • Launch NCCL collectives with CUDA graph

To enable IB Sharp buffer registration by local registration:

  • Allocate send and recv buffer with any CUDA allcator (e.g., cudaMalloc/ncclMemAlloc)
  • Register send and recv buffer for each rank in the communicator with ncclCommRegister
  • Launch NCCL collectives

General Buffer Registration

Since 2.23.x, NCCL supports intra-node buffer registration, which targets all peer-to-peer intra-node communications and brings less memory access, fewer SM usage and performance improvement. Either registering buffers by ncclCommRegister in the beginning or applying CUDA graph can enable intra-node buffer registration for NCCL collectives and sendrecv. The registered buffers can be allocated through legacy cuda API (e.g., cudaMalloc) as well as VMM API (e.g., cuMem* or ncclMemAlloc). However, VMM-allocated buffers are highly recommended since it is safer than legacy buffers during failure and abort.

Memory Allocator

For convenience, NCCL provides ncclMemAlloc function to help users to allocate buffers through VMM API, which can be used for NCCL registration later. It is only designed for NCCL so that it is not recommended to use ncclMemAlloc allocated buffers everywhere in the applications. For advanced users, if you want to create your own memory allocator for NVLS buffer registration, the allocator needs to satisfy the following requirements:

  • Allocate buffer with shared flag CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR and also CU_MEM_HANDLE_TYPE_FABRIC on GPUs where it’s supported.
  • Buffer size is multiple of multicast recommended granularity (i.e. cuMulticastGetGranularity(…, CU_MULTICAST_GRANULARITY_RECOMMENDED))
  • Buffer head address is at least aligned to multicast minimal granularity (i.e. cuMulticastGetGranularity(…, CU_MULTICAST_GRANULARITY_MINIMUM))