.. Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. NVIDIA CORPORATION and its licensors retain all intellectual property and proprietary rights in and to this software, related documentation and any modifications thereto. Any use, reproduction, disclosure or distribution of this software and related documentation without an express license agreement from NVIDIA CORPORATION is strictly prohibited. GXF Stream Sync ############### GXF Stream Sync is responsible for synchronization across two CUDA codelets without involving CPU wait. When two CUDA codelets are used, the first CUDA codelet that generates the data or triggers the CUDA kernel is called as the signaler. The second CUDA codelet that waits for the data or for the CUDA job that was submitted by the upstream codelet is called as the waiter. Signaling and waiting is based on a single synchronization object. Signaler and waiter both make use of the same synchronization object. CUDA stream is associated with the signaler and the waiter. The synchronization object provides APIs for signaling and waiting mechanisms. Signaler ******** The signaler codelet upon submitting all the work on a specific CUDA stream, will call the signalSemaphore API of synchronization object. Internally GXF stream sync will make use of a fence to track the completion of the tasks submitted on the CUDA stream. Signaling happens asynchronously on the GPU and the signalSemaphore API returns immediately. signalSemaphore will make use of the same CUDA stream on which the work was submitted. Signaler is also responsible for allocating the synchronization object and passes the same as message entity to the waiter. Waiter ****** The waiter codelet will issue a call to waitSemaphore and submit its own work to the same CUDA stream on which the signaler codelet submitted the work or it may make use of another CUDA stream. GXF stream sync will wait until the fence is signaled which ensures that the work submitted by the signaler codelet is complete. Waiting happens asynchronously on the GPU and the waitSemaphore API returns immediately. The below figure depicts concept of signaler and waiter .. image:: /content/cuda_cuda_stream_sync.svg :align: center :alt: GXF Stream sync **Figure: Synchronization across two CUDA codelets** .. _gxfStreamSyncExtension: GxfStreamExtension ****************** Extension for synchronization across two CUDA modules without a CPU wait. * UUID: :code:`918e6ad7-8e1a-43aa-9b49-251d4b6072b0` * Version: :code:`0.0.1` * Author: :code:`NVIDIA` * License: :code:`LICENSE` Components ========== nvidia::gxf::GxfStreamSync -------------------------- Component which helps to achieve synchronization across two CUDA codelets without involving CPU wait. Holds a synchronization object that can be used by the signaler and the waiter. * Component ID: 0011bee7-5d53-43ee-aafa-61485a436bc4 * Base Type: nvidia::gxf::Component * Defined in: gxf/stream/stream_nvscisync.hpp Parameters ^^^^^^^^^^ **signaler** Parameter indicating the type of signaler. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_INT32 | **waiter** Parameter indicating the type of waiter. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_INT32 | **signaler_device_id** Device id on which signaler is running. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_INT32 | **waiter_device_id** Device id on which waiter is running. * Flags: GXF_PARAMETER_FLAGS_NONE * Type: GXF_PARAMETER_TYPE_INT32 GXF Stream Sync Workflow ************************ Cuda to Cuda codelet communication happens with the help of message. At the Signaler codelet ======================= * Add StreamSync handle to the mesage. * Get the streamsync Handle. * Initiatlize streamsync * Allocate Sync Object based on the signaler and waiter * Set cuda Stream for signaler and waiter * Submit work of signaler codelet on CUDA stream. * Signal Semaphore (Asynchronous call) * Publish message At the Waiter Codelet ===================== * Receive the message * Find the streamsync handle * Wait Semaphore (Asynchronous call) * Submit the work of waiter codelet on CUDA stream. * Now wait will happen on the GPU asynchronously Example ======= Below example describes on how to make use of GXF Stream Sync in the application. Yaml file ^^^^^^^^^ .. code-block:: yaml :linenos: --- name: global components: - name: cuda_dot_pool type: nvidia::gxf::BlockMemoryPool parameters: storage_type: 1 # cuda block_size: 16384 num_blocks: 10 - name: stream_sync_cuda_to_cuda type: nvidia::gxf::StreamSync parameters: signaler: 1 # Cuda signaler waiter: 3 # Cuda waiter --- name: stream_tensor_generator components: - name: cuda_out type: nvidia::gxf::DoubleBufferTransmitter - name: generator type: nvidia::gxf::stream::test::StreamTensorGeneratorNew parameters: cuda_tx: cuda_out cuda_tensor_pool: global/cuda_pool stream_sync: global/stream_sync_cuda_to_cuda - type: nvidia::gxf::DownstreamReceptiveSchedulingTerm parameters: transmitter: cuda_out min_size: 1 - type: nvidia::gxf::CountSchedulingTerm parameters: count: 50 --- components: - type: nvidia::gxf::Connection parameters: source: stream_tensor_generator/cuda_out target: cuda_dotproduct/rx --- name: cuda_dotproduct components: - name: rx type: nvidia::gxf::DoubleBufferReceiver parameters: capacity: 2 - name: tx type: nvidia::gxf::DoubleBufferTransmitter - type: nvidia::gxf::MessageAvailableSchedulingTerm parameters: receiver: rx min_size: 1 - type: nvidia::gxf::DownstreamReceptiveSchedulingTerm parameters: transmitter: tx min_size: 1 - type: nvidia::gxf::stream::test::CublasDotProductNew parameters: rx: rx tx: tx tensor_pool: global/cuda_dot_pool ---