.. _group-calls: *********** Group Calls *********** Group functions (ncclGroupStart/ncclGroupEnd) can be used to merge multiple calls into one. This is needed for three purposes: managing multiple GPUs from one thread (to avoid deadlocks), aggregating communication operations to improve performance, or merging multiple send/receive point-to-point operations (see :ref:`point-to-point` section). All three usages can be combined together, with one exception: calls to :c:func:`ncclCommInitRank` cannot be merged with others. Management Of Multiple GPUs From One Thread ------------------------------------------- When a single thread is managing multiple devices, group semantics must be used. This is because every NCCL call may have to block, waiting for other threads/ranks to arrive, before effectively posting the NCCL operation on the given stream. Hence, a simple loop on multiple devices like shown below could block on the first call waiting for the other ones: .. warning:: We do not recommed using CUDA graph capture when managing multiple GPUs from one thread. In some cases ``cudaGraphLaunch`` may block, preventing the launch across all GPUs. See :ref:`using-nccl-with-cuda-graphs` for details. .. code:: C for (int i=0; i