4.13. Stream Management

This section describes the stream management functions of the low-level CUDA driver application programming interface.

Functions

CUresult cuStreamAddCallback ( CUstream hStream, CUstreamCallback callback, void* userData, unsigned int  flags )
Add a callback to a compute stream.
CUresult cuStreamAttachMemAsync ( CUstream hStream, CUdeviceptr dptr, size_t length, unsigned int  flags )
Attach memory to a stream asynchronously.
CUresult cuStreamCreate ( CUstream* phStream, unsigned int  Flags )
Create a stream.
CUresult cuStreamCreateWithPriority ( CUstream* phStream, unsigned int  flags, int  priority )
Create a stream with the given priority.
CUresult cuStreamDestroy ( CUstream hStream )
Destroys a stream.
CUresult cuStreamGetFlags ( CUstream hStream, unsigned int* flags )
Query the flags of a given stream.
CUresult cuStreamGetPriority ( CUstream hStream, int* priority )
Query the priority of a given stream.
CUresult cuStreamQuery ( CUstream hStream )
Determine status of a compute stream.
CUresult cuStreamSynchronize ( CUstream hStream )
Wait until a stream's tasks are completed.
CUresult cuStreamWaitEvent ( CUstream hStream, CUevent hEvent, unsigned int  Flags )
Make a compute stream wait on an event.

Functions

CUresult cuStreamAddCallback ( CUstream hStream, CUstreamCallback callback, void* userData, unsigned int  flags )
Add a callback to a compute stream.
Parameters
hStream
- Stream to add callback to
callback
- The function to call once preceding stream operations are complete
userData
- User specified data to be passed to the callback function
flags
- Reserved for future use, must be 0
Description

Adds a callback to be called on the host after all currently enqueued items in the stream have completed. For each cuStreamAddCallback call, the callback will be executed exactly once. The callback will block later work in the stream until it is finished.

The callback may be passed CUDA_SUCCESS or an error code. In the event of a device error, all subsequently executed callbacks will receive an appropriate CUresult.

Callbacks must not make any CUDA API calls. Attempting to use a CUDA API will result in CUDA_ERROR_NOT_PERMITTED. Callbacks must not perform any synchronization that may depend on outstanding device work or other callbacks that are not mandated to run earlier. Callbacks without a mandated order (in independent streams) execute in undefined order and may be serialized.

For the purposes of Unified Memory, callback execution makes a number of guarantees:

  • The callback stream is considered idle for the duration of the callback. Thus, for example, a callback may always use memory attached to the callback stream.

  • The start of execution of a callback has the same effect as synchronizing an event recorded in the same stream immediately prior to the callback. It thus synchronizes streams which have been "joined" prior to the callback.

  • Adding device work to any stream does not have the effect of making the stream active until all preceding callbacks have executed. Thus, for example, a callback might use global attached memory even if work has been added to another stream, if it has been properly ordered with an event.

  • Completion of a callback does not cause a stream to become active except as described above. The callback stream will remain idle if no device work follows the callback, and will remain idle across consecutive callbacks without device work in between. Thus, for example, stream synchronization can be done by signaling from a callback at the end of the stream.

Note:
  • This function uses standard default stream semantics.

  • Note that this function may also return error codes from previous, asynchronous launches.

See also:

cuStreamCreate, cuStreamQuery, cuStreamSynchronize, cuStreamWaitEvent, cuStreamDestroy, cuMemAllocManaged, cuStreamAttachMemAsync, cudaStreamAddCallback

CUresult cuStreamAttachMemAsync ( CUstream hStream, CUdeviceptr dptr, size_t length, unsigned int  flags )
Attach memory to a stream asynchronously.
Parameters
hStream
- Stream in which to enqueue the attach operation
dptr
- Pointer to memory (must be a pointer to managed memory)
length
- Length of memory (must be zero)
flags
- Must be one of CUmemAttach_flags
Description

Enqueues an operation in hStream to specify stream association of length bytes of memory starting from dptr. This function is a stream-ordered operation, meaning that it is dependent on, and will only take effect when, previous work in stream has completed. Any previous association is automatically replaced.

dptr must point to an address within managed memory space declared using the __managed__ keyword or allocated with cuMemAllocManaged.

length must be zero, to indicate that the entire allocation's stream association is being changed. Currently, it's not possible to change stream association for a portion of an allocation.

The stream association is specified using flags which must be one of CUmemAttach_flags. If the CU_MEM_ATTACH_GLOBAL flag is specified, the memory can be accessed by any stream on any device. If the CU_MEM_ATTACH_HOST flag is specified, the program makes a guarantee that it won't access the memory on the device from any stream on a device that has a zero value for the device attribute CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS. If the CU_MEM_ATTACH_SINGLE flag is specified and hStream is associated with a device that has a zero value for the device attribute CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, the program makes a guarantee that it will only access the memory on the device from hStream. It is illegal to attach singly to the NULL stream, because the NULL stream is a virtual global stream and not a specific stream. An error will be returned in this case.

When memory is associated with a single stream, the Unified Memory system will allow CPU access to this memory region so long as all operations in hStream have completed, regardless of whether other streams are active. In effect, this constrains exclusive ownership of the managed memory region by an active GPU to per-stream activity instead of whole-GPU activity.

Accessing memory on the device from streams that are not associated with it will produce undefined results. No error checking is performed by the Unified Memory system to ensure that kernels launched into other streams do not access this region.

It is a program's responsibility to order calls to cuStreamAttachMemAsync via events, synchronization or other means to ensure legal access to memory at all times. Data visibility and coherency will be changed appropriately for all kernels which follow a stream-association change.

If hStream is destroyed while data is associated with it, the association is removed and the association reverts to the default visibility of the allocation as specified at cuMemAllocManaged. For __managed__ variables, the default association is always CU_MEM_ATTACH_GLOBAL. Note that destroying a stream is an asynchronous operation, and as a result, the change to default association won't happen until all work in the stream has completed.

Note:
  • This function uses standard default stream semantics.

  • Note that this function may also return error codes from previous, asynchronous launches.

See also:

cuStreamCreate, cuStreamQuery, cuStreamSynchronize, cuStreamWaitEvent, cuStreamDestroy, cuMemAllocManaged, cudaStreamAttachMemAsync

CUresult cuStreamCreate ( CUstream* phStream, unsigned int  Flags )
Create a stream.
Parameters
phStream
- Returned newly created stream
Flags
- Parameters for stream creation
Description

Creates a stream and returns a handle in phStream. The Flags argument determines behaviors of the stream. Valid values for Flags are:

  • CU_STREAM_DEFAULT: Default stream creation flag.

  • CU_STREAM_NON_BLOCKING: Specifies that work running in the created stream may run concurrently with work in stream 0 (the NULL stream), and that the created stream should perform no implicit synchronization with stream 0.

Note:

Note that this function may also return error codes from previous, asynchronous launches.

See also:

cuStreamDestroy, cuStreamCreateWithPriority, cuStreamGetPriority, cuStreamGetFlags, cuStreamWaitEvent, cuStreamQuery, cuStreamSynchronize, cuStreamAddCallback, cudaStreamCreate, cudaStreamCreateWithFlags

CUresult cuStreamCreateWithPriority ( CUstream* phStream, unsigned int  flags, int  priority )
Create a stream with the given priority.
Parameters
phStream
- Returned newly created stream
flags
- Flags for stream creation. See cuStreamCreate for a list of valid flags
priority
- Stream priority. Lower numbers represent higher priorities. See cuCtxGetStreamPriorityRange for more information about meaningful stream priorities that can be passed.
Description

Creates a stream with the specified priority and returns a handle in phStream. This API alters the scheduler priority of work in the stream. Work in a higher priority stream may preempt work already executing in a low priority stream.

priority follows a convention where lower numbers represent higher priorities. '0' represents default priority. The range of meaningful numerical priorities can be queried using cuCtxGetStreamPriorityRange. If the specified priority is outside the numerical range returned by cuCtxGetStreamPriorityRange, it will automatically be clamped to the lowest or the highest number in the range.

Note:
  • Note that this function may also return error codes from previous, asynchronous launches.

  • Stream priorities are supported only on GPUs with compute capability 3.5 or higher.

  • In the current implementation, only compute kernels launched in priority streams are affected by the stream's priority. Stream priorities have no effect on host-to-device and device-to-host memory operations.

See also:

cuStreamDestroy, cuStreamCreate, cuStreamGetPriority, cuCtxGetStreamPriorityRange, cuStreamGetFlags, cuStreamWaitEvent, cuStreamQuery, cuStreamSynchronize, cuStreamAddCallback, cudaStreamCreateWithPriority

CUresult cuStreamDestroy ( CUstream hStream )
Destroys a stream.
Parameters
hStream
- Stream to destroy
Description

Destroys the stream specified by hStream.

In case the device is still doing work in the stream hStream when cuStreamDestroy() is called, the function will return immediately and the resources associated with hStream will be released automatically once the device has completed all work in hStream.

Note:

Note that this function may also return error codes from previous, asynchronous launches.

See also:

cuStreamCreate, cuStreamWaitEvent, cuStreamQuery, cuStreamSynchronize, cuStreamAddCallback, cudaStreamDestroy

CUresult cuStreamGetFlags ( CUstream hStream, unsigned int* flags )
Query the flags of a given stream.
Parameters
hStream
- Handle to the stream to be queried
flags
- Pointer to an unsigned integer in which the stream's flags are returned The value returned in flags is a logical 'OR' of all flags that were used while creating this stream. See cuStreamCreate for the list of valid flags
Description

Query the flags of a stream created using cuStreamCreate or cuStreamCreateWithPriority and return the flags in flags.

Note:

Note that this function may also return error codes from previous, asynchronous launches.

See also:

cuStreamDestroy, cuStreamCreate, cuStreamGetPriority, cudaStreamGetFlags

CUresult cuStreamGetPriority ( CUstream hStream, int* priority )
Query the priority of a given stream.
Parameters
hStream
- Handle to the stream to be queried
priority
- Pointer to a signed integer in which the stream's priority is returned
Description

Query the priority of a stream created using cuStreamCreate or cuStreamCreateWithPriority and return the priority in priority. Note that if the stream was created with a priority outside the numerical range returned by cuCtxGetStreamPriorityRange, this function returns the clamped priority. See cuStreamCreateWithPriority for details about priority clamping.

Note:

Note that this function may also return error codes from previous, asynchronous launches.

See also:

cuStreamDestroy, cuStreamCreate, cuStreamCreateWithPriority, cuCtxGetStreamPriorityRange, cuStreamGetFlags, cudaStreamGetPriority

CUresult cuStreamQuery ( CUstream hStream )
Determine status of a compute stream.
Parameters
hStream
- Stream to query status of
Description

Returns CUDA_SUCCESS if all operations in the stream specified by hStream have completed, or CUDA_ERROR_NOT_READY if not.

For the purposes of Unified Memory, a return value of CUDA_SUCCESS is equivalent to having called cuStreamSynchronize().

Note:
  • This function uses standard default stream semantics.

  • Note that this function may also return error codes from previous, asynchronous launches.

See also:

cuStreamCreate, cuStreamWaitEvent, cuStreamDestroy, cuStreamSynchronize, cuStreamAddCallback, cudaStreamQuery

CUresult cuStreamSynchronize ( CUstream hStream )
Wait until a stream's tasks are completed.
Parameters
hStream
- Stream to wait for
Description

Waits until the device has completed all operations in the stream specified by hStream. If the context was created with the CU_CTX_SCHED_BLOCKING_SYNC flag, the CPU thread will block until the stream is finished with all of its tasks.

Note:
  • This function uses standard default stream semantics.

  • Note that this function may also return error codes from previous, asynchronous launches.

See also:

cuStreamCreate, cuStreamDestroy, cuStreamWaitEvent, cuStreamQuery, cuStreamAddCallback, cudaStreamSynchronize

CUresult cuStreamWaitEvent ( CUstream hStream, CUevent hEvent, unsigned int  Flags )
Make a compute stream wait on an event.
Parameters
hStream
- Stream to wait
hEvent
- Event to wait on (may not be NULL)
Flags
- Parameters for the operation (must be 0)
Description

Makes all future work submitted to hStream wait until hEvent reports completion before beginning execution. This synchronization will be performed efficiently on the device. The event hEvent may be from a different context than hStream, in which case this function will perform cross-device synchronization.

The stream hStream will wait only for the completion of the most recent host call to cuEventRecord() on hEvent. Once this call has returned, any functions (including cuEventRecord() and cuEventDestroy()) may be called on hEvent again, and subsequent calls will not have any effect on hStream.

If cuEventRecord() has not been called on hEvent, this call acts as if the record has already completed, and so is a functional no-op.

Note:
  • This function uses standard default stream semantics.

  • Note that this function may also return error codes from previous, asynchronous launches.

See also:

cuStreamCreate, cuEventRecord, cuStreamQuery, cuStreamSynchronize, cuStreamAddCallback, cuStreamDestroy, cudaStreamWaitEvent