GXF Stream Sync#

GXF Stream Sync is responsible for synchronization across two CUDA codelets without involving CPU wait. When two CUDA codelets are used, the first CUDA codelet that generates the data or triggers the CUDA kernel is called as the signaler. The second CUDA codelet that waits for the data or for the CUDA job that was submitted by the upstream codelet is called as the waiter. Signaling and waiting is based on a single synchronization object. Signaler and waiter both make use of the same synchronization object. CUDA stream is associated with the signaler and the waiter. The synchronization object provides APIs for signaling and waiting mechanisms.

Signaler#

The signaler codelet upon submitting all the work on a specific CUDA stream, will call the signalSemaphore API of synchronization object. Internally GXF stream sync will make use of a fence to track the completion of the tasks submitted on the CUDA stream. Signaling happens asynchronously on the GPU and the signalSemaphore API returns immediately. signalSemaphore will make use of the same CUDA stream on which the work was submitted. Signaler is also responsible for allocating the synchronization object and passes the same as message entity to the waiter.

Waiter#

The waiter codelet will issue a call to waitSemaphore and submit its own work to the same CUDA stream on which the signaler codelet submitted the work or it may make use of another CUDA stream. GXF stream sync will wait until the fence is signaled which ensures that the work submitted by the signaler codelet is complete. Waiting happens asynchronously on the GPU and the waitSemaphore API returns immediately.

The below figure depicts concept of signaler and waiter

Figure: Synchronization across two CUDA codelets

GxfStreamExtension#

Extension for synchronization across two CUDA modules without a CPU wait.

UUID: 918e6ad7-8e1a-43aa-9b49-251d4b6072b0
Version: 0.5.0
Author: NVIDIA
License: LICENSE

Components#

nvidia::gxf::GxfStreamSync#

Component which helps to achieve synchronization across two CUDA codelets without involving CPU wait. Holds a synchronization object that can be used by the signaler and the waiter.

Component ID: 0011bee7-5d53-43ee-aafa-61485a436bc4
Base Type: nvidia::gxf::Component
Defined in: gxf/stream/stream_nvscisync.hpp

Parameters#

signaler

Parameter indicating the type of signaler.

Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_INT32

waiter

Parameter indicating the type of waiter.

Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_INT32

signaler_device_id

Device id on which signaler is running.

Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_INT32

waiter_device_id

Device id on which waiter is running.

Flags: GXF_PARAMETER_FLAGS_NONE
Type: GXF_PARAMETER_TYPE_INT32

GXF Stream Sync Workflow#

Cuda to Cuda codelet communication happens with the help of message.

At the Signaler codelet#

Add StreamSync handle to the mesage.
Get the streamsync Handle.
Initiatlize streamsync
Allocate Sync Object based on the signaler and waiter
Set cuda Stream for signaler and waiter
Submit work of signaler codelet on CUDA stream.
Signal Semaphore (Asynchronous call)
Publish message

At the Waiter Codelet#

Receive the message
Find the streamsync handle
Wait Semaphore (Asynchronous call)
Submit the work of waiter codelet on CUDA stream.
Now wait will happen on the GPU asynchronously

Example#

Below example describes on how to make use of GXF Stream Sync in the application.

Yaml file#

---
name: global
components:
- name: cuda_dot_pool
   type: nvidia::gxf::BlockMemoryPool
   parameters:
     storage_type: 1 # cuda
     block_size: 16384
     num_blocks: 10
- name: stream_sync_cuda_to_cuda
type: nvidia::gxf::StreamSync
parameters:
   signaler: 1 # Cuda signaler
   waiter: 3   # Cuda waiter
---
name: stream_tensor_generator
components:
- name: cuda_out
type: nvidia::gxf::DoubleBufferTransmitter
- name: generator
type: nvidia::gxf::stream::test::StreamTensorGeneratorNew
parameters:
   cuda_tx: cuda_out
   cuda_tensor_pool: global/cuda_pool
   stream_sync: global/stream_sync_cuda_to_cuda
- type: nvidia::gxf::DownstreamReceptiveSchedulingTerm
parameters:
   transmitter: cuda_out
   min_size: 1
- type: nvidia::gxf::CountSchedulingTerm
parameters:
   count: 50
---
components:
- type: nvidia::gxf::Connection
parameters:
   source: stream_tensor_generator/cuda_out
   target: cuda_dotproduct/rx
---
name: cuda_dotproduct
components:
- name: rx
type: nvidia::gxf::DoubleBufferReceiver
parameters:
   capacity: 2
- name: tx
type: nvidia::gxf::DoubleBufferTransmitter
- type: nvidia::gxf::MessageAvailableSchedulingTerm
parameters:
   receiver: rx
   min_size: 1
- type: nvidia::gxf::DownstreamReceptiveSchedulingTerm
parameters:
   transmitter: tx
   min_size: 1
- type: nvidia::gxf::stream::test::CublasDotProductNew
parameters:
   rx: rx
   tx: tx
   tensor_pool: global/cuda_dot_pool
---