..
   Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
   NVIDIA CORPORATION and its licensors retain all intellectual property
   and proprietary rights in and to this software, related documentation
   and any modifications thereto. Any use, reproduction, disclosure or
   distribution of this software and related documentation without an express
   license agreement from NVIDIA CORPORATION is strictly prohibited.

GXF Stream Sync
###############

GXF Stream Sync is responsible for synchronization across two CUDA codelets
without involving CPU wait. When two CUDA codelets are used, the first CUDA
codelet that generates the data or triggers the CUDA kernel is called as the
signaler. The second CUDA codelet that waits for the data or for the CUDA job
that was submitted by the upstream codelet is called as the waiter.
Signaling and waiting is based on a single synchronization object. Signaler
and waiter both make use of the same synchronization object. CUDA stream is
associated with the signaler and the waiter. The synchronization object
provides APIs for signaling and waiting mechanisms.

Signaler
********
The signaler codelet upon submitting all the work on a specific CUDA stream,
will call the signalSemaphore API of synchronization object. Internally GXF
stream sync will make use of a fence to track the completion of the tasks
submitted on the CUDA stream. Signaling happens asynchronously on the GPU and
the signalSemaphore API returns immediately.
signalSemaphore will make use of the same CUDA stream on which the work was
submitted.
Signaler is also responsible for allocating the synchronization object and
passes the same as message entity to the waiter.

Waiter
******
The waiter codelet will issue a call to waitSemaphore and submit its own work to
the same CUDA stream on which the signaler codelet submitted the work or it may
make use of another CUDA stream. GXF stream sync will wait until the fence is
signaled which ensures that the work submitted by the signaler codelet is
complete. Waiting happens asynchronously on the GPU and the waitSemaphore API
returns immediately.

The below figure depicts concept of signaler and waiter

.. image:: /content/cuda_cuda_stream_sync.svg
   :align: center
   :alt: GXF Stream sync

**Figure: Synchronization across two CUDA codelets**

.. _gxfStreamSyncExtension:

GxfStreamExtension
******************

Extension for synchronization across two CUDA modules without a CPU wait.

* UUID: :code:`918e6ad7-8e1a-43aa-9b49-251d4b6072b0`
* Version: :code:`0.0.1`
* Author: :code:`NVIDIA`
* License: :code:`LICENSE`

Components
==========

nvidia::gxf::GxfStreamSync
--------------------------
Component which helps to achieve synchronization across two CUDA codelets
without involving CPU wait.
Holds a synchronization object that can be used by the signaler and the waiter.

* Component ID: 0011bee7-5d53-43ee-aafa-61485a436bc4
* Base Type: nvidia::gxf::Component
* Defined in: gxf/stream/stream_nvscisync.hpp


Parameters
^^^^^^^^^^

**signaler**

Parameter indicating the type of signaler.

* Flags: GXF_PARAMETER_FLAGS_NONE
* Type: GXF_PARAMETER_TYPE_INT32

|

**waiter**

Parameter indicating the type of waiter.

* Flags: GXF_PARAMETER_FLAGS_NONE
* Type: GXF_PARAMETER_TYPE_INT32

|

**signaler_device_id**

Device id on which signaler is running.

* Flags: GXF_PARAMETER_FLAGS_NONE
* Type: GXF_PARAMETER_TYPE_INT32

|

**waiter_device_id**

Device id on which waiter is running.

* Flags: GXF_PARAMETER_FLAGS_NONE
* Type: GXF_PARAMETER_TYPE_INT32

GXF Stream Sync Workflow
************************

Cuda to Cuda codelet communication happens with the help of message.

At the Signaler codelet
=======================
* Add StreamSync handle to the mesage.
* Get the streamsync Handle.
* Initiatlize streamsync
* Allocate Sync Object based on the signaler and waiter
* Set cuda Stream for signaler and waiter
* Submit work of signaler codelet on CUDA stream.
* Signal Semaphore (Asynchronous call)
* Publish message

At the Waiter Codelet
=====================
* Receive the message
* Find the streamsync handle
* Wait Semaphore (Asynchronous call)
* Submit the work of waiter codelet on CUDA stream.
* Now wait will happen on the GPU asynchronously


Example
=======
Below example describes on how to make use of GXF Stream Sync in the application.

Yaml file
^^^^^^^^^

..  code-block:: yaml
   :linenos:

   ---
   name: global
   components:
   - name: cuda_dot_pool
      type: nvidia::gxf::BlockMemoryPool
      parameters:
        storage_type: 1 # cuda
        block_size: 16384
        num_blocks: 10
   - name: stream_sync_cuda_to_cuda
   type: nvidia::gxf::StreamSync
   parameters:
      signaler: 1 # Cuda signaler
      waiter: 3   # Cuda waiter
   ---
   name: stream_tensor_generator
   components:
   - name: cuda_out
   type: nvidia::gxf::DoubleBufferTransmitter
   - name: generator
   type: nvidia::gxf::stream::test::StreamTensorGeneratorNew
   parameters:
      cuda_tx: cuda_out
      cuda_tensor_pool: global/cuda_pool
      stream_sync: global/stream_sync_cuda_to_cuda
   - type: nvidia::gxf::DownstreamReceptiveSchedulingTerm
   parameters:
      transmitter: cuda_out
      min_size: 1
   - type: nvidia::gxf::CountSchedulingTerm
   parameters:
      count: 50
   ---
   components:
   - type: nvidia::gxf::Connection
   parameters:
      source: stream_tensor_generator/cuda_out
      target: cuda_dotproduct/rx
   ---
   name: cuda_dotproduct
   components:
   - name: rx
   type: nvidia::gxf::DoubleBufferReceiver
   parameters:
      capacity: 2
   - name: tx
   type: nvidia::gxf::DoubleBufferTransmitter
   - type: nvidia::gxf::MessageAvailableSchedulingTerm
   parameters:
      receiver: rx
      min_size: 1
   - type: nvidia::gxf::DownstreamReceptiveSchedulingTerm
   parameters:
      transmitter: tx
      min_size: 1
   - type: nvidia::gxf::stream::test::CublasDotProductNew
   parameters:
      rx: rx
      tx: tx
      tensor_pool: global/cuda_dot_pool
   ---