**************************************
Point To Point Communication Functions
**************************************

NCCL provides two types of point-to-point communication primitives: two-sided operations and one-sided operations.

Two-Sided Point-to-Point Operations
====================================

(Since NCCL 2.7) Two-sided point-to-point communication primitives need to be used when ranks need to send and
receive arbitrary data from each other, which cannot be expressed as a broadcast or allgather, i.e.
when all data sent and received is different. Both sender and receiver must explicitly participate.

ncclSend
--------

.. c:function:: ncclResult_t ncclSend(const void* sendbuff, size_t count, ncclDataType_t datatype, int peer, ncclComm_t comm, cudaStream_t stream)

 Send data from ``sendbuff`` to rank ``peer``.

 Rank ``peer`` needs to call ncclRecv with the same ``datatype`` and the same ``count`` as this rank.

 This operation is blocking for the GPU. If multiple :c:func:`ncclSend` and :c:func:`ncclRecv` operations
 need to progress concurrently to complete, they must be fused within a :c:func:`ncclGroupStart`/
 :c:func:`ncclGroupEnd` section.

Related links: :ref:`point-to-point`.

ncclRecv
--------

.. c:function:: ncclResult_t ncclRecv(void* recvbuff, size_t count, ncclDataType_t datatype, int peer, ncclComm_t comm, cudaStream_t stream)

 Receive data from rank ``peer`` into ``recvbuff``.

 Rank ``peer`` needs to call ncclSend with the same ``datatype`` and the same ``count`` as this rank.

 This operation is blocking for the GPU. If multiple :c:func:`ncclSend` and :c:func:`ncclRecv` operations
 need to progress concurrently to complete, they must be fused within a :c:func:`ncclGroupStart`/
 :c:func:`ncclGroupEnd` section.

Related links: :ref:`point-to-point`.

One-Sided Point-to-Point Operations (RMA)
==========================================

One-sided Remote Memory Access (RMA) operations enable ranks to directly access remote memory without
explicit participation from the target process. These operations require the target memory to be
pre-registered within a symmetric memory window using :c:func:`ncclCommWindowRegister`.

ncclPutSignal
-------------

.. c:function:: ncclResult_t ncclPutSignal(const void* localbuff, size_t count, ncclDataType_t datatype, int peer, ncclWindow_t peerWin, size_t peerWinOffset, int sigIdx, int ctx, unsigned int flags, ncclComm_t comm, cudaStream_t stream)

 Write data from ``localbuff`` to rank ``peer``'s registered memory window ``peerWin`` at offset ``peerWinOffset``
 and subsequently updating a remote signal.

 The target memory window ``peerWin`` must be registered using :c:func:`ncclCommWindowRegister`.

 The ``sigIdx`` is the signal index identifier for the operation. It must be set to 0 for now.

 The ``ctx`` is the context identifier for the operation. It must be set to 0 for now.

 The ``flags`` parameter is reserved for future use. It must be set to 0 for now.

 The return of :c:func:`ncclPutSignal` to the CPU thread indicates that the operation has been successfully enqueued to the CUDA stream.
 At the completion of :c:func:`ncclPutSignal` on the CUDA stream, the ``localbuff`` is safe to reuse or modify.
 When a signal is updated on the remote peer, it guarantees that the data from the corresponding :c:func:`ncclPutSignal` operation has been delivered to the remote memory.
 All prior :c:func:`ncclPutSignal` and :c:func:`ncclSignal` operations to the same peer and context have also completed their signal updates.

Related links: :ref:`point-to-point`.

ncclSignal
----------

.. c:function:: ncclResult_t ncclSignal(int peer, int sigIdx, int ctx, unsigned int flags, ncclComm_t comm, cudaStream_t stream)

 Send a signal to rank ``peer`` without transferring data.

 The ``sigIdx`` is the signal index identifier for the operation. It must be set to 0 for now.

 The ``ctx`` is the context identifier for the operation. It must be set to 0 for now.

 The ``flags`` parameter is reserved for future use. It must be set to 0 for now.

 When a signal is updated on the remote peer, all prior :c:func:`ncclPutSignal` and :c:func:`ncclSignal` operations
 to the same peer and context have also completed their signal updates.

Related links: :ref:`point-to-point`.

ncclWaitSignal
--------------

.. c:type:: ncclWaitSignalDesc_t

 Descriptor that specifies how many signal operations to wait for
 from a particular rank on a given signal index and context.

 .. c:member:: int opCnt

  Number of signal operations to wait for.

 .. c:member:: int peer

  Target peer to wait for signals from.

 .. c:member:: int sigIdx

  Signal index identifier. Must be set to 0 for now.

 .. c:member:: int ctx

  Context identifier. Must be set to 0 for now.

.. c:function:: ncclResult_t ncclWaitSignal(int nDesc, ncclWaitSignalDesc_t* signalDescs, ncclComm_t comm, cudaStream_t stream)

 Wait for signals as described in the signal descriptor array.

 The ``nDesc`` parameter specifies the number of signal descriptors in the ``signalDescs`` array.
 Each descriptor indicates how many signals (``opCnt``) to expect from a specific ``peer``
 on a particular signal index (``sigIdx``) and context (``ctx``).

 The return of :c:func:`ncclWaitSignal` to the CPU thread indicates that the operation has been successfully enqueued to the CUDA stream.
 At the completion of :c:func:`ncclWaitSignal` on the CUDA stream, all specified signal operations have been received and the corresponding data is visible in local memory.

Related links: :ref:`point-to-point`.