2. API synchronization behavior
The API provides memcpy/memset functions in both synchronous and asynchronous forms, the latter having an "Async" suffix. This is a misnomer as each function may exhibit synchronous or asynchronous behavior depending on the arguments passed to the function.
Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources. Such behavior is subject to change and undocumented behavior should not be relied upon.
Memcpy
In the reference documentation, each memcpy function is categorized as synchronous or asynchronous, corresponding to the definitions below.
-
For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.
-
For transfers from pinned host memory to device memory, the function is synchronous with respect to the host.
-
For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed.
-
For transfers from device memory to device memory, no host-side synchronization is performed.
-
For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.
-
For transfers between device memory and pageable host memory, the function might be synchronous with respect to host.
-
For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.
-
If pageable memory must first be staged to pinned memory, the driver may synchronize with the stream and stage the copy into pinned memory.
-
For all other transfers, the function should be fully asynchronous.