CUDA NVSHMEM Interopability

This section describes some key CUDA and NVSHMEM API interopability considerations when developing applications using NVSHMEM runtime.

Using CUDA Streams APIs

As recommended by the CUDA toolkit, users are encouraged to use the async APIs cudaMemcpyAsync, cudaMemsetAsync over the non-async version of cudaMemcpy and cudaMemset because the application can be impacted by subtle synchronization behavior that is observed by the non-async version of these APIs. For NVSHMEM applications, the following usage can lead to incorrect behavior:

cudaMemcpy() // with target as device memory
app_kernel<<<>>>(); // Kernels Uses NVSHMEM API on target of previous cudaMemcpy() and can access stale data

Here is the correct usage:

cudaMemcpyAsync(..., stream);
cudaStreamSynchronize(stream);
app_kernel<<<>>>();

NVSHMEM sets the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS attribute, which automatically synchronizes the synchronous CUDA memory operations on the symmetric heap. As a result, the application does not need to call cudaDeviceSynchronize(). Starting with CUDA 11.3, NVSHMEM will use the CUDA VMM API for the symmetric heap. Synchronous memory operations support was added for the symmetric heap that was created using CUDA VMM API in CUDA 12.1 and NVSHMEM 2.10.1.

When NVSHMEM uses CUDA VMM, and the CUDA version is earlier than 12.1, to achieve behavioral parity with applications that use a later version of CUDA that automatically synchronizes CUDA memory operations, the application needs to explicitly use cudaDeviceSynchronize(). Additionally, users must be careful when using these async operations with GPUDirect async data transfers. If a similar device synchronization, or barriered operations are not used, this process can lead to race conditions.