User Buffer Registration¶

User Buffer Registration is a feature that allows NCCL to directly send/receive/operate data through the user buffer without extra internal copy. It can accelerate collectives and reduce the resource usage (e.g. #channel usage).

NCCL 2.19.x now supports user buffer registration for NVLink Switch; any NCCL collectives that support NVLS algorithm can utilize this feature. To enable the CUDA Graph based buffer registration, users have to comply with several requirements:

The buffer is allocated through ncclMemAlloc().

Offset to the head address of the buffer is same for each rank in collectives.

The NCCL operation is launched on a stream captured by a CUDA graph.

Registered buffers will be deregistered when cuda graph is destroyed.

On the other hand, to enable the Local based buffer registration, users have to comply with the following requirements:

The buffer is allocated through ncclMemAlloc().

Register buffer with ncclCommRegister() before calling collectives.

Call NCCL collectives as usual but similarly keep the offset to the head address of the buffer same for each rank.

Registered buffers will be deregistered when users explicitly call ncclCommDeregister().