Synchronization

NVIDIA DRIVE OS 5.1 Linux SDK

Developer Guide

5.1.9.0 Release

Synchronization

This chapter describes how to set up synchronization objects required by an application or a set of applications, and how to use them to control the order in which operations are performed by NVIDIA hardware. The basic setup process is similar to the process used for allocating buffers described in the previous chapter:

• Specify the restrictions imposed by the hardware components that signal the sync objects.

• Specify the restrictions imposed by the hardware components that wait on the sync fences.

• Gather the information for each set of sync objects into one application, which allocates them.

• Share the allocated sync objects with other applications.

• Map the sync objects into UMD specific interfaces.

The streaming process issues commands to update the sync objects, and commands to waits for those updates, so that different sets of operations remain in sync with each other.

Terminology

Agent: An entity in the system that executes instructions. An agent may be a CPU thread and also a hardware engine. An agent is the entity that interacts with actual hardware primitives abstracted by NvSciSync. The goal of NvSciSync is to synchronize agents.

Synchronization Basics

In an NVIDIA hardware system, there are typically multiple execution agents running simultaneously. For example: CPU, GPU, and other engines. Task interfaces to engines behave asynchronously. The CPU prepares a job for it and queues it in the interface. There might be multiple jobs pending on an engine and they are scheduled automatically in the hardware. CPU and applications don't know upfront the order in which jobs are executed.

This creates a need for job synchronization with dependencies between them, like a pipeline of several engines working on the same data, which preferably should not involve expensive CPU intervention.

NVIDIA hardware supports multiple synchronization mechanisms that solve this problem. They are context sensitive and not every engine understands all the mechanisms.

NvSciSync provides an abstraction layer that hides details of synchronization primitives used in a concrete situation. One of the most basic concepts of NvSciSync is a sync object. It abstracts a single instance of a specific synchronization primitive. A sync object has a current state and can be signaled. Signaling a sync object moves it to the next state. Normally, an application developer associates a sync object with a chain of events that must occur in the same order. For example, a video input engine may always signal the same sync object after producing a camera frame. This way, the sync object can be inspected at any time to check which frames were already written. A sync object must only be signaled by a single agent. An agent that signals (at the request of an application) is called a signaler.

Another basic concept of NvSciSync is a sync fence. A sync fence is associated with a specific sync object and contains a snapshot of that object's state. A fence is considered expired if its snapshot is behind or equal to the current state of the object. A fence whose state has not yet been reached by the object is said to be pending. Usually, multiple fences are associated with a single sync object and might correspond to different states of that object. A sync fence is generated by the signaler application and shared with others. An application can make an agent wait on a fence. An agent waiting on a sync fence is called a waiter in the context of the given sync object.

NvSciSync Module

To use NvSciSync you must first open an NvSciSyncModule. This module represents the library's instance created for that application and acts as a container for other NvSciSync resources. Typically, there is only a single NvSciSyncModule in an application but having all resources contained in NvSciSyncModule allows multiple threads or other libraries to use NvSciSync in an isolated manner. All other NvSciSync resources are associated with an NvSciSyncModule on creation.

NvSciSyncModule

NvSciSyncModule module = NULL;

NvSciError err;

err = NvSciSyncModuleOpen(&module);

if (err != NvSciError_Success) {

goto fail;

}

/* ... */

NvSciSyncModuleClose(module);

Inter-Application

If there are multiple processes involved, all communication of NvSciSync structures should go via NvSciIpc channels. Each application needs to open its own Ipc endpoints.

NvSciIpc init

NvSciIpcEndpoint ipcEndpoint = 0;

NvSciError err;

err = NvSciIpcInit();

if (err != NvSciError_Success) {

goto fail;

}

err = NvSciIpcOpenEndpoint("ipc_endpoint", &ipcEndpoint);

if (err != NvSciError_Success) {

goto fail;

}

/* ... */

NvSciIpcCloseEndpoint(ipcEndpoint);

NvSciIpcDeinit();

NvSciSync Attributes

NvSciSync clients must supply the properties and constraints of an NvsciSync object to NvSciSync before allocating the object. This is expressed with attributes. An attribute is a key - value pair. You can view all supported keys in the header files together with value types that can be used with them.

Each application wanting to use a sync object indicates its needs in the form of various attributes before the sync object is created. Those attributes are then communicated to the signaler, who gathers all applications' attributes and has NvSciSync reconcile them. Successful reconciliation creates a new attribute list satisfying all applications constraints. The signaler then allocates a sync object using resources described by those attributes. This sync object, together with reconciled attributes list, are then shared with all waiters that need access to this sync object.

NvSciSync Attributes List

Attributes coming from a single source are kept in an attribute list structure.

NvSciSyncAttrList

NvSciSyncAttrList attrList = NULL;

NvSciError err;

/* create a new, empty attribute list */

err = NvSciSyncAttrListCreate(module, &signalerAttrList);

if (err != NvSciError_Success) {

goto fail;

}

* fill the list - this example corresponds to a CPU signaler

* that only needs to signal but not wait

NvSciSyncAttrKeyValuePair keyValue[2] = {0};

bool cpuSignaler = true;

keyValue[0].attrKey = NvSciSyncAttrKey_NeedCpuAccess;

keyValue[0].value = (void*) &cpuSignaler;

keyValue[0].len = sizeof(cpuSignaler);

NvSciSyncAccessPerm cpuPerm = NvSciSyncAccessPerm_SignalOnly;

keyValue[1].attrKey = NvSciSyncAttrKey_RequiredPerm;

keyValue[1].value = (void*) &cpuPerm;

keyValue[1].len = sizeof(cpuPerm);

err = NvSciSyncAttrListSetAttrs(list, keyValue, 2);

if (err != NvSciError_Success) {

goto fail;

}

/* ... */

NvSciSyncAttrListFree(signalerAttrList);

Reconciliation

After gathering all attribute lists, the signaler application must reconcile them. Successful reconciliation results in a new, reconciled attribute list that satisfies all applications' requirements. If NvSciSync cannot create such a list because attributes contradict, it instead creates an attribute list that describes the conflicts in more detail. In the example below, assume that waiterAttrList1 and waiterAttrList2 were created in the same process, so the variables are visible.

NvSciSyncAttrList unreconciledList[3] = {NULL};

NvSciSyncAttrList reconciledList = NULL;

NvSciSyncAttrList newConflictList = NULL;

NvSciError err;

unreconciledList[0] = signalerAttrList;

unreconciledList[1] = waiterAttrList1;

unreconciledList[2] = waiterAttrList2;

err = NvSciSyncAttrListReconcile(

unreconciledList, /* array of unreconciled lists */

3, /* size of this array */

&reconciledList, /* output reconciled list */

&newConflictList); /* conflict description filled in case of reconciliation failure */

if (err != NvSciError_Success) {

goto fail;

}

/* ... */

NvSciSyncAttrListFree(reconciledList);

NvSciSyncAttrListFree(newConflictList);

NvSciSync recognizes which attribute lists are reconciled and which are not. Some NvSciSync APIs that take NvSciSyncAttrList expect the list to be reconciled.

Inter-Application

The above example assumes that waiterAttrList1 and waiterAttrList2 were received from the waiter applications. Hence, no cross-process semantics were required. If a waiter lives in another process, then the attributeList must be exported to a descriptor first, communicated via NvSciIpc, and then imported on the receiver's end. In some cases, there are multiple lists travelling through multiple NvSciIpc channels.

Export/Import NvSciSyncattrList

/* waiter */

NvSciSyncAttrList waiterAttrList = NULL;

void* waiterListDesc = NULL;

size_t waiterListDescSize = 0U;

NvSciError err;

/* creation of the attribute list, receiving other lists from other listeners */

err = NvSciSyncAttrListIpcExportUnreconciled(

&waiterAttrList, /* array of unreconciled lists to be exported */

1, /* size of the array */

ipcEndpoint, /* valid and opened NvSciIpcEndpoint intended to send the descriptor through */

&waiterListDesc, /* The descriptor buffer to be allocated and filled in */

&waiterListDescSize ); /* size of the newly created buffer */

if (err != NvSciError_Success) {

goto fail;

}

/* send the descriptor to the signaler */

NvSciSyncAttrListFreeDesc(waiterListDesc);

/* signaler */

void* waiterListDesc = NULL;

size_t waiterListDescSize = 0U;

NvSciSyncAttrList unreconciledList[2] = {NULL};

NvSciSyncAttrList reconciledList = NULL;

NvSciSyncAttrList newConflictList = NULL;

NvSciSyncAttrList signalerAttrList = NULL;

NvSciSyncAttrList importedUnreconciledAttrList = NULL;

/* create the local signalerAttrList */

/* receive the descriptor from the waiter */

err = NvSciSyncAttrListIpcImportUnreconciled(module, ipcEndpoint,

waiterListDesc, waiterListDescSize,

&importedUnreconciledAttrList);

if (err != NvSciError_Success) {

goto fail;

}

/* gather all the lists into an array and reconcile */

unreconciledList[0] = signalerAttrList;

unreconciledList[1] = importedUnreconciledAttrList;

err = NvSciSyncAttrListReconcile(unreconciledList, 2, &reconciledList,

&newConflictList);

if (err != NvSciError_Success) {

goto fail;

}

/* ... */

NvSciSyncAttrListFree(importedUnreconciledAttrList);

NvSciSyncAttrListFree(reconciledList);

Sync Management

The following sections describe sync management.

NvSciSync Objects

The NvSciSyncObj structure represents a sync object. It can be allocated after successful reconciliation. The reconciled attribute list contains all information about resources needed for the allocation.

NvSciSyncObj

/* Create NvSciSync object and get the syncObj */

NvSciError err;

err = NvSciSyncObjAlloc(reconciledList, &syncObj);

if (err != NvSciError_Success) {

goto fail;

}

/* using the object, sharing it with others */

NvSciSyncAttrListFree(reconciledList);

NvSciSyncObjFree(syncObj);

Inter-Application

In the cross-process case, the reconciled list and object must be shared with all the waiters.

Export/Import NvSciSyncObj

/* signaler */

void* objAndList;

size_t objAndListSize;

NvSciError err;

err = NvSciSyncIpcExportAttrListAndObj(

syncObj, /* syncObj to be exported (the reconciled list is inside it) */

NvSciSyncAccessPerm_WaitOnly, /* permissions we want the receiver to have. Setting this to NvSciSyncAccessPerm_Auto allows NvSciSync to automatically determine necessary permissions */

ipcEndpoint, /* IpcEndpoint via which the object is to be exported */

&objAndList, /* descriptor of the object and list to be communicated */

&objAndListSize); /* size of the descriptor */

/* send via Ipc */

NvSciSyncAttrListAndObjFreeDesc(objAndList);

/* waiter */

void* objAndList;

size_t objAndListSize;

err = NvSciSyncIpcImportAttrListAndObj(

module, /* NvSciSyncModule use to create original unreconciled lists in the waiter */

ipcEndpoint, /* ipcEndpoint from which the descriptor was received */

objAndList, /* the desciptor of the sync obj and associated reconciled attribute list received from the signaler */

objAndListSize, /* size of the descriptor */

&waiterAttrList, /* the array of original unreconciled lists prepared in the waiter */

1, /* size of the array */

NvSciSyncAccessPerm_WaitOnly, /* permissions expected by the waiter. Setting this to NvSciSyncAccessPerm_Auto allows NvSciSync to automatically determine necessary permissions */

10000U, /* timeout in microseconds. Some primitives might require time to transport all needed resources */

&syncObj); /* sync object generated from the descriptor on the waiter's side */

/* use the sync object, perhaps export it to more peers... */

NvSciSyncObjFree(syncObj);

Cpu Wait Contexts

NvSciSync can be used to wait on a fence from the CPU but it might require some additional resources to perform this wait. Allocating those resources is controlled by the application and encapsulated in the NvSciSyncCpuWaitContext structure. In the initialization phase a waiter allocates this structure. It can then be used to wait on any number of sync fences but it cannot be used from multiple threads at the same time:

NvSciSyncCpuWaitContext

/* waiter */

NvSciSyncCpuWaitContext waitContext = NULL;

NvSciError err;

/* initialize module */

err = NvSciSyncCpuWaitContextAlloc(module, &waitContext);

if (err != NvSciError_Success) {

goto fail;

}

/* more initialization, using the context for fence waiting */

NvSciSyncCpuWaitContextFree(waitContext);

NvSciSyncFence Operations

After successfully allocating an object and exporting it to all the waiters, the application can proceed to the runtime phase. Typically, it is a loop where the signaler prepares its job, associates its completion with a sync fence, and shares the fence with a waiter. The waiter constructs its job in such a way that it only starts after the fence expires. Both waiter and signaler than enqueue their jobs and the use of fences establish ordering: the waiter's job only starts when the signaler’s job is complete.

NvSciSyncFence Cpu operations

/* signaler*/

NvSciSyncFence sharedFence = NvSciSyncFenceInitializer;

NvSciSyncFence localFence = NvSciSyncFenceInitializer; /* always initialize with the Initializer */

NvSciError err;

err = NvSciSyncObjGenerateFence(syncObj, &localFence);

if (err != NvSciError_Success) {

goto fail;

}

/* duplicate fence before sharing */

err = NvSciSyncFenceDup(&localFence, sharedFence);

if (err != NvSciError_Success) {

goto fail;

}

/* create more duplicates if necessary */

/* communicate the fence to the waiter. */

/* local copy no longer necessary, so dispose of it */

NvSciSyncFenceClear(&localFence); /* this call cleans references to sync object and is needed for proper freeing */

/* do something else, like some CPU job */

err = NvSciSyncObjSignal(syncObj);

if (err != NvSciError_Success) {

goto fail;

}

/* waiter */

/* receive the sharedFence from the signaler */

err = NvSciSyncFenceWait(sharedFence,

waitContext, NV_WAIT_INFINITE);

if (err != NvSciError_Success) {

return err;

}

NvSciSyncFenceClear(sharedFence);

Inter-Application

The above examples assume an inter thread case but in an inter process case the fence must be exported and imported, similar to how the attribute lists were packaged.

Export/Import NvSciSyncFence

/* signaler*/

NvSciSyncFenceIpcExportDescriptor fenceDesc;

NvSciError err;

/* generate sharedFence */

err = NvSciSyncIpcExportFence(

&sharedFence, /* fence to be exported */

ipcEndpoint, /* should be the same ipcEndpoint used for communicating the attribute lists */

&fenceDesc); /* fence descriptor has a fixed size and is only filled in this call */

if (err != NvSciError_Success) {

return err;

}

NvSciSyncFenceClear(&sharedFence);

/* send the descriptor via Ipc */

/* waiter */

/* receive the descriptor fenceDesc */

err = NvSciSyncIpcImportFence(syncObj,

fenceDesc,

&syncFence);

if (err != NvSciError_Success) {

return err;

}

Fences are designed to be small, fixed sized objects, and interactions with them do not involve any runtime allocation. All fence structures and fence descriptors are allocated once at initialization. During runtime, NvSciSync only updates the fence and related structures, as needed.

Timestamps

NvSciSync supports timestamps in fences. They represent the exact time of a fence’s expiration. This can help profiling the timing of streaming and debugging performance issues. This feature can be enabled during the initialization of the sync object. Then the waiter can call NvSciSyncFenceGetTimestamp to obtain the timestamp data in the streaming phase.

To enable this feature, the waiter should set NvSciSyncAttrKey_WaiterRequireTimestamps to true in its attribute list.

Waiter requires timestamp

bool requireTimestamps = true;

NvSciSyncAttrKeyValuePair keyValue[] = {

... // other attributes

{ .attrKey = NvSciSyncAttrKey_WaiterRequireTimestamps,

.value = (void*) &requireTimestamps,

.len = sizeof(bool),

};

During reconciliation, if the signaler supports timestamp, this feature is enabled. If the waiter doesn't require timestamp, then this feature is disabled. If the waiter requires a timestamp but the signaler doesn't support it, then the reconciliation fails.

During the streaming phase, the waiter can obtain the timestamp value by calling NvSciSyncFenceGetTimestamp on an expired fence.

Waiter gets timestamp

uint64_t timestamp;

err = NvSciSyncFenceWait(&fence, waitContext, NV_WAIT_INFINITE);

if (err != NvSciError_Success) {

return err;

}

err = NvSciSyncFenceGetTimestamp(&fence, &timestamp);

if (err != NvSciError_Success) {

return err;

}

UMD Access

The following sections describe UMD access.

NvMedia

NvMedia provides a set of interfaces to submit tasks to each of the hardware engines it supports. For example, the NvMedia2D* set of APIs provide the functionality to submit tasks to the VIC engine. Similarly, the NvMediaISP* set of APIs provide the functionality to submit tasks to the ISP hardware engine. An engine specific set of APIs are extended to support NvSciSync. They provide the following functionalities:

• Takes in an NvSciSyncFence as input when the engine acts as a waiter.

• Gives out an NvSciSyncFence as output when the engine acts as a signaler.

The following section uses NvMedia2D-NvSciSync APIs to demonstrate the usage of NvMedia-NvSciSync APIs. A similar set of APIs for other engines can be used in the same way.

Definitions

EOF Fence: End of frame fence. A fence whose expiry indicates that the output image is written.

PREFence: Start of engine operation is blocked until this fence expires.

NvMedia2D-NvSciSync

1. Query NvSciSyncObj attributes (for waiting or signaling) from NvMedia2D.

Use NvMedia2DFillNvSciSyncAttrList API to query the NvSciSync attributes from NvMedia2D. NvSciSync objects allocated with such NvSciSyncAttrLists are only accepted by NvMedia2D-NvSciSync APIs.

• NvSciSync object registration/unregistration with NvMedia2D.

Use NvMedia2DRegisterNvSciSyncObj API to register the NvSciSync objects with NvMedia2D. Every NvSciSyncObj used by NvMedia2D must be registered upfront with NvMedia2D.

During tear down, use NvMedia2DUnRegisterNvSciSyncObj API to unregister the registered NvSciSyncObjs with NvMedia2D.

• Set NvSciSyncObj for end of frame (EOF) event usage with NvMedia2D.

Use NvMedia2DSetNvSciSyncObjforEOF API to tell NvMedia2D which NvSciSyncObj to use to signal the EOF event of the NvMedia2DBlitEx operation. A NvSciSyncObj must be set before calling NvMedia2DBlitEx API.

• Wait for an NvSciSyncFence.

Use NvMedia2DInsertPreNvSciSyncFence API to tell NvMedia2D to wait on a NvSciSyncFence before actually starting the VIC engine to work on the task submitted by NvMedia2DBlitEx API.

• Get an NvSciSyncFence.

Use NvMedia2DGetEOFNvSciSyncFence API to get an NvSciSyncFence whose expiry indicates that the last submitted NvMedia2DBlitEx task has completed. NvMedia2DGetEOFNvSciSyncFence API can be called only after an NvMedia2DBlitEx API call.

NvMedia2D NvSciSync API Usage

/* ************ Init-time **********/

NvSciSyncModule nvscisyncModule;

NvSciError nverr;

NvSciSyncAttrList nvscisyncattr_w;

NvSciSyncAttrList nvscisyncattr_s;

NvSciSyncAttrList nvscisyncattr_unreconciled_h[2];

NvSciSyncAttrList nvscisyncattr_reconciled;

NvSciSyncAttrList ConflictAttrList;

NvSciSyncFence eofnvscisyncfence = NV_SCI_SYNC_FENCE_INITIALIZER;

NvSciSyncObj nvscisyncEOF, nvscisyncpre;

nvm2dhdl = NvMedia2DCreate(nvmdevice);

nverr = NvSciSyncModuleOpen(&nvscisyncModule);

/**********NvMedia 2D as signaler ************/

nverr = NvSciSyncAttrListCreate(nvscisyncModule, &nvscisyncattr_s);

nvmstatus = NvMedia2DFillNvSciSyncAttrList(nvscisyncattr_s, NVMEDIA_SIGNALER);

nvscisyncattr_unreconciled_h[0] = nvscisyncattr_s;

nvscisyncattr_unreconciled_h[1] = get attribute list from the appropriate waiter;

nverr = NvSciSyncAttrListReconcile(nvscisyncattr_unreconciled_h[],

2 , &nvscisyncattr_reconciled, &ConflictAttrList);

nverr = NvSciSyncObjAlloc(nvscisyncattr_reconciled, &nvscisyncEOF);

/**********NvMedia 2D as waiter ************/

nverr = NvSciSyncAttrListCreate(&nvscisyncattr_w);

nvmstatus = NvMedia2DFillNvSciSyncAttrList(nvscisyncattr_w, NVMEDIA_WAITER);

/*If the signaler is also in the same process as the 2D Waiter, then

NvSciSyncAttrListReconcileAndObjAlloc or NvSciSyncAttrListReconcile and

NvSciSyncObjAlloc API pair has/have to be used to allocate nvscisyncpre NvSciScynObject.

If the signaler is in a different process/VM than the 2D Waiter, then

NvSciSyncAttrList export/import APIs and NvSciSyncObjIpc Export/Import APIs

have to be used allocate a NvSciSyncObject on signaler and waiter sides.

nvscisyncpre is the imported NvSciSyncObject on the waiter side */

/*All the NvSciSyncObjects(NvSciSyncObjects associated with PreFences, EOFFence

) which will be used by NvMedia2D must be registered upfront. */

/* **********Start of Registration of NvSciSync Objects ************/

nvmstatus = NvMedia2DRegisterNvSciSyncObj(nvm2dhdl, NVMEDIA_EOFSYNCOBJ, nvscisyncEOF);

/* Register all the NvSciSync objects which will be used to generate prefences for

NvMedia2DBlit operation. nvscisyncpre is one such Pre NvSciSync object */

nvmstatus = NvMedia2DRegisterNvSciSyncObj(nvm2dhdl, NVMEDIA_PRESYNCOBJ, nvscisyncpre);

**********End of Registration of NvSciSync Objects ************

/*Allocate a NvMediaImage for input, say inputimg */

/*Allocate a NvMediaImage for output, say outputimg */

******End of Init-time and Start of Run-time *********

nvmstatus = NvMedia2DSetNvSciSyncObjforEOF(nvm2dhdl, nvscisyncEOF);

/*Get a nvscisyncfence from somewhere(maybe a eofnvscisyncfence of

some other engine operation) which neeeds to be inserted as prefence

for 2DBlit operation. prenvscisyncfence is one such NvSciSyncFence. */

nvmstatus = NvMedia2DInsertPreNvSciSyncFence(nvm2dhdl, prenvscisyncfence);

nvmstatus = NvMedia2DBlitEx(nvm2dhdl, outputimg, NULL, inputimg, NULL,

2dblitparams, paramsout);

nvmstatus = NvMedia2DGetEOFNvSciSyncFence(nvm2dhdl, nvscisyncEOF, &eofnvscisyncfence);

/*eofnvscisyncfence may be used as prefence for some other engine operation

or application can decide to wait on CPU till their expiry using NvSciSyncWait API. */

/* ************* End of Run time ****************

/*Unregister all of the registered NvSciSync objects */

nvmstatus = NvMedia2DUnRegisterNvSciSyncObj(nvm2dhdl, nvscisyncEOF);

nvmstatus = NvMedia2DUnRegisterNvSciSyncObj(nvm2dhdl, nvscisyncpre);

NvSciSyncAttrListFree(nvscisyncattr_w);

NvSciSyncAttrListFree(nvscisyncattr_s);

NvSciSyncAttrListFree(nvscisyncattr_reconciled);

NvSciSyncObjFree(nvscisyncEOF);

NvSciSyncObjFree(nvscisyncpre);

NvSciSyncModuleClose(nvscisyncModule);

CUDA

CUDA supports NvSciSync by enabling applications to signal and wait for them on a CUDA stream. (Signaling an NvSciSync is similar to cudaEventRecord and waiting for an NvSciSync is similar to issuing cudaStreamWaitEvent). CUDA treats NvSciSync as an external semaphore object of type cudaExternalSemaphoreHandleType, which can be imported into the CUDA address space. The application can use existing cudaExternalSemaphore API to build dependencies between an NvSciSync object and CUDA streams, and vice-versa. Since cudaExternalSemaphore APIs are treated as regular stream operations, CUDA-NvSciSync interop follows regular stream semantics.

CUDA APIs

• Query NvSciSyncObj attributes (for waiting or signaling) from CUDA

Use cudaDeviceGetNvSciSyncAttributes API to query the NvSciSync attributes from CUDA for a given CUDA device. NvSciSyncAttrLists passed to this API must be allocated and managed by the application.

• NvSciSync object registration/unregistration with CUDA

Use cudaImportExternalSemaphore API to register/import an NvSciSync Objects into the CUDA address space. This API accepts a valid NvSciSyncObject as a parameter to semHandleDesc. On completion, the extSem_out returned internally holds a reference to the NvSciSyncObject passed earlier and must be sent to the APIs listed below.

Use cudaDestroyExternalSemaphore (for runtime) to unregister/destroy an already registered/imported NvSciSync Object from the CUDA address space.

• Wait for an NvSciSyncFence.

Use cudaWaitExternalSemaphoresAsync to make all operations enqueued on the CUDA stream (passed as a parameter to this API) wait until the NvSciSyncFence (sent as a parameter to this API via paramsArray) is signaled by the relevant signaler. Such a wait happens asynchronously on the GPU (i.e., the calling thread returns immediately). Applications can also optionally set flag CUDA_EXTERNAL_SEMAPHORE_WAIT_SKIP_NVSCIBUF_MEMSYNC to indicate that memory synchronization operations are disabled over all CUDA-NvSciBufs imported into CUDA (in that process), which are normally performed by default to ensure data coherency with other importers of the same NvSciBuf memory objects. Use this flag when CUDA-NvSciSync is used to build only control-dependencies (i.e., no data sharing between the signaler and waiter).

• Get an NvSciSyncFence.

cudaSignalExternalSemaphoresAsync takes a valid NvSciSyncFence as input. Upon return, the fence tracks the completion of all work submitted to the same CUDA stream on which the API was invoked. Waiting on a fence is equivalent to waiting for the completion of all the work on the stream. This API ensures that when the dependent work (in the stream) completes, the NvSciSyncFence is signaled, and any potential waiters waiting on the NvSciSyncFence are unblocked. The signal happens asynchronously in the GPU (i.e., the calling thread returns immediately). Applications can also optionally set flag CUDA_EXTERNAL_SEMAPHORE_SIGNAL_SKIP_NVSCIBUF_MEMSYNC to indicate that memory synchronization operations are disabled over all CUDA-NvSciBufs imported into CUDA (in that process), which are normally performed by default to ensure data coherency with other importers of the same NvSciBuf memory objects. Use this flag when CUDA-NvSciSync is used to build only control-dependencies (i.e., no data sharing between the signaler and waiter).

Note:

• cudaWait|SignalExternalSemaphoresAsync API takes an array of cudaExternalSemaphore_t and cudaExternalSemaphoresWait|SignalParams. This allows the application to enqueue one or more external semaphore objects, each being one of the cudaExternalSemaphoreHandleType types. This option is an efficient way to describe a dependency between a CUDA stream and more than one NvSciSyncFence as a single operation.

• cudaSignalExternalSemaphoresAsync overwrites the previous contents of NvSciSyncFence passed to it.

CUDA-NvSciSync API Usage

NvSciSyncFence *signalerFence = NULL;

NvSciSyncFence *waiterFence = NULL;

NvSciIpcEndpoint signalerIpcEndpoint = 0;

NvSciIpcEndpoint waiterIpcEndpoint = 0;

NvSciSyncAttrList unreconciledList[2] = {NULL};

NvSciSyncAttrList reconciledList = NULL;

NvSciSyncAttrList newConflictList = NULL;

NvSciSyncAttrList signalerAttrList = NULL;

NvSciSyncAttrList waiterAttrList = NULL;

NvSciSyncAttrList importedWaiterAttrList = NULL;

NvSciSyncObjIpcExportDescriptor objDesc;

NvSciSyncFenceIpcExportDescriptor fenceDesc;

NvSciSyncObj signalObj;

NvSciSyncObj waitObj;

NvSciSyncModule module = NULL;

void* objAndList;

size_t objAndListSize = 0;

void* waiterListDesc;

size_t waiterAttrListSize = 0;

CUcontext signalerCtx = 0;

CUcontext waiterCtx = 0;

int iGPU = 0;

int dGPU = 1;

cudaStream_t signalerCudaStream;

cudaStream_t waiterCudaStream;

cudaExternalSemaphore_t signalerSema, waiterSema;

cudaExternalSemaphoreHandleDesc semaDesc;

cudaExternalSemaphoreSignalParams sigParams;

cudaExternalSemaphoreWaitParams waitParams;

/*****************INIT PHASE**************************/

err = NvSciSyncModuleOpen(&module);

err = NvSciIpcInit();

err = NvSciIpcOpenEndpoint("ipc_test", &signalerIpcEndpoint);

err = NvSciIpcOpenEndpoint("ipc_test", &waiterIpcEndpoint);

err = NvSciSyncAttrListCreate(module, &signalerAttrList);

err = NvSciSyncAttrListCreate(module, &waiterAttrList);

signalerFence = (NvSciSyncFence *)calloc(1, sizeof(*signalerFence));

waiterFence = (NvSciSyncFence *)calloc(1, sizeof(*waiterFence));

cudaFree(0);

cudaSetDevice(iGPU);// Signaler will be on Device-1/iGPU

cuCtxCreate(&signalerCtx, CU_CTX_MAP_HOST, iGPU);

cudaSetDevice(dGPU);// Waiter will be on Device-0/dGPU

cuCtxCreate(&waiterCtx, CU_CTX_MAP_HOST, dGPU);

cuCtxPushCurrent(signalerCtx);

cudaStreamCreate(&signalerCudaStream);

cuCtxPopCurrent(&signalerCtx);

cuCtxPushCurrent(waiterCtx);

cudaStreamCreate(&waiterCudaStream);

cuCtxPopCurrent(&waiterCtx);

cuCtxPushCurrent(waiterCtx);

cudaDeviceGetNvSciSyncAttributes(waiterAttrList, dGPU, cudaNvSciSyncAttrWait);

err = NvSciSyncAttrListIpcExportUnreconciled(&waiterAttrList, 1, waiterIpcEndpoint, &waiterListDesc, &waiterAttrListSize);

// Allocate cuda memory for the signaler, if needed

cuCtxPopCurrent(&waiterCtx);

cuCtxPushCurrent(signalerCtx);

cudaDeviceGetNvSciSyncAttributes(signalerAttrList, iGPU, cudaNvSciSyncAttrSignal);

// Allocate cuda memory for the waiter, if needed

err = NvSciSyncAttrListIpcImportUnreconciled(module, signalerIpcEndpoint, waiterListDesc, waiterAttrListSize, &importedWaiterAttrList);

cuCtxPopCurrent(&signalerCtx);

unreconciledList[0] = signalerAttrList;

unreconciledList[1] = importedWaiterAttrList;

err = NvSciSyncAttrListReconcile(unreconciledList, 2, &reconciledList, &newConflictList);

err = NvSciSyncObjAlloc(reconciledList, &signalObj);

// Export Created NvSciSyncObj and attribute list to waiter

err = NvSciSyncIpcExportAttrListAndObj(signalObj, NvSciSyncAccessPerm_WaitOnly, signalerIpcEndpoint, &objAndList, &objAndListSize);

// Import already created NvSciSyncObj into a new NvSciSyncObj

err = NvSciSyncIpcImportAttrListAndObj(module, waiterIpcEndpoint, objAndList, objAndListSize, &waiterAttrList, 1, NvSciSyncAccessPerm_WaitOnly, 1000000, &waitObj);

cuCtxPushCurrent(signalerCtx);

semaDesc.type = cudaExternalSemaphoreHandleTypeNvSciSync;

semaDesc.handle.nvSciSyncObj = (void*)signalObj;

cudaImportExternalSemaphore(&signalerSema, &semaDesc);

cuCtxPopCurrent(&signalerCtx);

cuCtxPushCurrent(waiterCtx);

semaDesc.type = cudaExternalSemaphoreHandleTypeNvSciSync;

semaDesc.handle.nvSciSyncObj = (void*)waitObj;

cudaImportExternalSemaphore(&waiterSema, &semaDesc);

cuCtxPopCurrent(&waiterCtx);

/**********************************************************/

/*****************STREAMING PHASE**************************/

cuCtxPushCurrent(signalerCtx);

sigParams.params.nvSciSync.fence = (void*)signalerFence;

sigParams.flags = 0; //Set flags = cudaExternalSemaphoreSignalSkipNvSciBufMemSync if needed

// LAUNCH CUDA WORK ON signalerCudaStream

cudaSignalExternalSemaphoresAsync(&signalerSema, &sigParams, 1, signalerCudaStream);

err = NvSciSyncIpcExportFence(signalerFence, signalerIpcEndpoint, &fenceDesc);

NvSciSyncFenceClear(signalerFence);

cuCtxPopCurrent(&signalerCtx);

cuCtxPushCurrent(waiterCtx);

err = NvSciSyncIpcImportFence(waitObj, &fenceDesc, waiterFence);

waitParams.params.nvSciSync.fence = (void*)waiterFence;

waitParams.flags = 0; //Set flags = cudaExternalSemaphoreWaitSkipNvSciBufMemSync if needed

cudaWaitExternalSemaphoresAsync(&waiterSema, &waitParams, 1, waiterCudaStream);

// LAUNCH CUDA WORK ON waiterCudaStream

cudaStreamSynchronize(waiterCudaStream);

cuCtxPopCurrent(&waiterCtx);

/**********************************************************/

/*****************TEAR-DOWN PHASE**************************/

NvSciSyncObjFree(signalObj);

NvSciSyncObjFree(waitObj);

NvSciSyncAttrListFree(reconciledList);

NvSciSyncAttrListFree(newConflictList);

NvSciSyncAttrListFree(signalerAttrList);

NvSciSyncAttrListFree(waiterAttrList);

NvSciSyncAttrListFree(importedWaiterAttrList);

NvSciSyncModuleClose(module);

NvSciIpcCloseEndpoint(signalerIpcEndpoint);

NvSciIpcCloseEndpoint(waiterIpcEndpoint);

cudaStreamSynchronize(signalerCudaStream);

cudaStreamSynchronize(waiterCudaStream);

cudaStreamDestroy(waiterCudaStream);

cudaStreamDestroy(signalerCudaStream);

cudaDestroyExternalSemaphore(signalerSema);

cudaDestroyExternalSemaphore(waiterSema);

cuCtxDestroy(signalerCtx);

cuCtxDestroy(waiterCtx);

free(signalerFence);

free(waiterFence);

/**********************************************************/

Sample Application

CPU Signaler Usage

NvSciSyncAttrList unreconciledList[2] = {NULL};

NvSciSyncAttrList reconciledList = NULL;

NvSciSyncAttrList newConflictList = NULL;

NvSciSyncAttrList signalerAttrList = NULL;

NvSciSyncModule module = NULL;

NvSciSyncObj syncObj = NULL;

NvSciSyncAttrList importedUnreconciledAttrList = NULL;

NvSciSyncFence syncFence = NvSciSyncFenceInitializer;

NvSciIpcEndpoint ipcEndpoint = 0;

NvSciSyncFenceIpcExportDescriptor fenceDesc;

void* waiterAttrListDesc;

size_t waiterAttrListSize;

void* objAndListDesc;

size_t objAndListSize;

NvSciSyncAttrKeyValuePair keyValue[2] = {0};

bool cpuSignaler = true;

NvSciSyncAccessPerm cpuPerm;

/* Initialize NvSciIpc */

err = NvSciIpcInit();

if (err != NvSciError_Success) {

goto fail;

}

err = NvSciIpcOpenEndpoint("example", &ipcEndpoint);

if (err != NvSciError_Success) {

goto fail;

}

/* Signaler Setup/Init phase */

/* Initialize the NvSciSync module */

err = NvSciSyncModuleOpen(&module);

if (err != NvSciError_Success) {

goto fail;

}

/* create local attribute list */

err = NvSciSyncAttrListCreate(module, &signalerAttrList);

if (err != NvSciError_Success) {

goto fail;

}

err = largs->fillSignalerAttrList(signalerAttrList);

if (err != NvSciError_Success) {

goto fail;

}

cpuSignaler = true;

keyValue[0].attrKey = NvSciSyncAttrKey_NeedCpuAccess;

keyValue[0].value = (void*) &cpuSignaler;

keyValue[0].len = sizeof(cpuSignaler);

cpuPerm = NvSciSyncAccessPerm_SignalOnly;

keyValue[1].attrKey = NvSciSyncAttrKey_RequiredPerm;

keyValue[1].value = (void*) &cpuPerm;

keyValue[1].len = sizeof(cpuPerm);

err = NvSciSyncAttrListSetAttrs(list, keyValue, 2);

if (err != NvSciError_Success) {

goto fail;

}

/* receive waiterAttrListSize; */

/* receive waiterAttrListDesc */

err = NvSciSyncAttrListIpcImportUnreconciled(

module, ipcEndpoint,

waiterAttrListDesc, waiterAttrListSize,

&importedUnreconciledAttrList);

if (err != NvSciError_Success) {

goto fail;

}

unreconciledList[0] = signalerAttrList;

unreconciledList[1] = importedUnreconciledAttrList;

/* Reconcile Signaler and Waiter NvSciSyncAttrList */

err = NvSciSyncAttrListReconcile(

unreconciledList, 2, &reconciledList,

&newConflictList);

if (err != NvSciError_Success) {

goto fail;

}

/* Create NvSciSync object and get the syncObj */

err = NvSciSyncObjAlloc(reconciledList, &syncObj);

if (err != NvSciError_Success) {

goto fail;

}

/* Export attr list and obj and signal waiter*/

err = NvSciSyncIpcExportAttrListAndObj(

syncObj,

NvSciSyncAccessPerm_WaitOnly, ipcEndpoint,

&objAndListDesc, &objAndListSize);

/* send objAndListSize */

/* send objAndListDesc */

/* signaler's streaming phase */

err = NvSciSyncObjGenerateFence(syncObj, &syncFence);

if (err != NvSciError_Success) {

return err;

}

err = NvSciSyncIpcExportFence(&syncFence, ipcEndpoint, &fenceDesc);

if (err != NvSciError_Success) {

goto fail;

}

NvSciSyncFenceClear(&syncFence);

/* do job that the waiter is supposed to wait on */

NvSciSyncObjSignal(syncObj);

/* cleanup */

fail:

/* Free descriptors */

free(objAndListDesc);

free(waiterAttrListDesc);

/* Free NvSciSyncObj */

NvSciSyncObjFree(syncObj);

/* Free Attribute list objects */

NvSciSyncAttrListFree(reconciledList);

NvSciSyncAttrListFree(newConflictList);

NvSciSyncAttrListFree(signalerAttrList);

NvSciSyncAttrListFree(importedUnreconciledAttrList);

/* Deinitialize the NvSciSync module */

NvSciSyncModuleClose(module);

/* Deinitialize NvSciIpc */

NvSciIpcCloseEndpoint(ipcEndpoint);

NvSciIpcDeinit();

CPU Waiter

NvSciSyncAttrKeyValuePair keyValue[2] = {0};

NvSciSyncModule module = NULL;

NvSciSyncAttrList waiterAttrList = NULL;

void* waiterAttrListDesc;

size_t waiterAttrListSize;

NvSciSyncObj syncObj = NULL;

void* objAndListDesc = NULL;

size_t objAndListSize = 0U;

NvSciSyncCpuWaitContext waitContext = NULL;

NvSciSyncFenceIpcExportDescriptor fenceDesc;

NvSciIpcEndpoint ipcEndpoint = 0;

bool cpuWaiter = true;

NvSciSyncAttrKeyValuePair keyValue[2] = {0};

NvSciSyncAccessPerm cpuPerm = NvSciSyncAccessPerm_WaitOnly;

err = NvSciIpcInit();

if (err != NvSciError_Success) {

goto fail;

}

err = NvSciIpcOpenEndpoint("example", &ipcEndpoint);

if (err != NvSciError_Success) {

goto fail;

}

/* Waiter Setup/Init phase */

/* Initialize the NvSciSync module */

err = NvSciSyncModuleOpen(&module);

if (err != NvSciError_Success) {

goto fail;

}

err = NvSciSyncCpuWaitContextAlloc(module, &waitContext);

if (err != NvSciError_Success) {

goto fail;

}

/* Get waiter's NvSciSyncAttrList from NvSciSync for CPU waiter */

err = NvSciSyncAttrListCreate(module, &waiterAttrList);

if (err != NvSciError_Success) {

goto fail;

}

cpuWaiter = true;

keyValue[0].attrKey = NvSciSyncAttrKey_NeedCpuAccess;

keyValue[0].value = (void*) &cpuWaiter;

keyValue[0].len = sizeof(cpuWaiter);

cpuPerm = NvSciSyncAccessPerm_WaitOnly;

keyValue[1].attrKey = NvSciSyncAttrKey_RequiredPerm;

keyValue[1].value = (void*) &cpuPerm;

keyValue[1].len = sizeof(cpuPerm);

err = NvSciSyncAttrListSetAttrs(list, keyValue, 2);

if (err != NvSciError_Success) {

goto fail;

}

/* Export waiter's NvSciSyncAttrList */

err = NvSciSyncAttrListIpcExportUnreconciled(

&waiterAttrList, 1,

ipcEndpoint,

&waiterAttrListDesc, &waiterAttrListSize);

if (err != NvSciError_Success) {

goto fail;

}

/* send waiterAttrListSize */

/* send waiterAttrListDesc */

/* receive objAndListDesc */

err = NvSciSyncIpcImportAttrListAndObj(

module, ipcEndpoint,

objAndListDesc, objAndListSize,

&waiterAttrList, 1,

NvSciSyncAccessPerm_WaitOnly, 10000U, &syncObj);

if (err != NvSciError_Success) {

goto fail;

}

/* Waiter streaming phase */

/* receive fenceDesc */

err = NvSciSyncIpcImportFence(

syncObj,

&fenceDesc,

&syncFence);

if (err != NvSciError_Success) {

goto fail;

}

err = NvSciSyncFenceWait(

&syncFence,

waitContext, 30000U);

if (err != NvSciError_Success) {

goto fail;

}

NvSciSyncFenceClear(&syncFence);

/* cleanup */

fail:

free(waiterAttrListDesc);

free(objAndListDesc);

NvSciSyncAttrListFree(waiterAttrList);

NvSciSyncObjFree(syncObj);

NvSciSyncCpuWaitContextFree(waitContext);

/* Deinitialize the NvSciSync module */

NvSciSyncModuleClose(module);

/* Deinitialize NvSciIpc */

NvSciIpcCloseEndpoint(ipcEndpoint);

NvSciIpcDeinit();