Device APIs

This section describes some key device APIs considerations when developing applications using NVSHMEM runtime.

Device APIs on Peer-to-Peer Transport

Some applications can improve performance for device API calls by adding transport awareness. For peer-to-peer (P2P) transfers, the device APIs will use the GPU’s Streaming Multiprocessorors (SMs) to move data. The best performance can be achieved by balancing the work per GPU thread across a pool of thread blocks. Using multiple thread blocks can improve the efficiency of data movement as compared to a single thread block, which is not sufficient to saturate the peer-to-peer communication bandwidth over NVIDIA (R) NVLink (R). The standard NVSHMEM APIs use one GPU thread to perform the data movement. When moving large amounts of data, one thread might not be efficient for peer-to-peer communication, so applications should use warp or block variants of this API, such as nvshmemx_TYPENAME_put_block or nvshmemx_TYPENAME_put_warp instead of nvshmem_TYPENAME_put. For example, the following table describes the behavior difference between these APIs when trying to move an 128 element chunk from GPU A to GPU B over P2P transport:

Comparison between put/p/put_block APIs
Comparison Parameters	`nvshmem_float_put_block` on 1 thread block (128 threads)	`nvshmem_float_put/p` on 128 threads	Recommendation
Internal Synchronization	Epilog/prolog of this API would contain `__sync_threads`, which would result in additional latency of sync. For large enough data, this overhead is amortized.		Prefer `put/p` over `put_block` for latency parameter
Number of Memory Stores Operations (ops)	32 x 16B store, as internally 16B alignment store is preferred (>> 8B > 4B > 2B > 1B) to move data as efficiently as possible	128 x 4B store	Prefer `put_block` over `put/p` to minimize number of store ops
Automatic Transport Coalescing	Yes	Yes	Use any, for auto transport coalescing of contiguous elements in the chunk

Applications using device APIs over this transport can characterize its performance using shmem_put_bw performance benchmark, provided as part of the installation, to determine number of GPU thread warps and thread blocks to maximize NVLink utilization/bandwidth.

The application should also use spatial locality and data coalescing capabilities of the P2P by allocating or selecting contiguous addressable memory destinations for the target data before dispatching the data transfers using the device APIs. For example, nvshmem_p might be efficient for small message transfers if target addresses are contiguous in memory because P2P will automatically coalesce multiple, small messages into one large message transfer.

The behavior of blocking vs non-blocking version (nvshmemx_TYPENAME_<ops>_nbi) of these APIs are identical for this transport.

Device APIs on Proxy-Based Transport

Some applications can improve performance for device API calls by adding transport awareness. For proxy-based transfers such as IBRC, UCX, libfabric, and so on, the data transfer will be offloaded to the NVSHMEM runtime internal single CPU proxy thread. The single CPU proxy thread might not be sufficient to drive small message transfers over the remote network, so application must pack as much data as possible into large messages for proxy-based transport. For example, nvshmem_p might not be efficient for small message transfers because the proxy-based transport will not automatically coalesce data into a large message transfer. In this scenario, the application should pack as much data as possible and use nvshmem_put instead. For example, the following table describes the behavior difference between these APIs when trying to move an 128 element chunk from GPU A to GPU B over proxy-based transport:

Comparison between put/p/put_block APIs
Comparison Parameters	`nvshmem_float_put` on 1 thread	`nvshmem_float_put_block` on 1 thread block (128 threads)	`nvshmem_float_put/p` on 128 threads	Recommendation
Ownership of GPU threads	NVSHMEM exclusively uses 1 thread, while rest of the threads are available to user for other ops in kernel	NVSHMEM internally uses 1 thread to issue send op, while remaining threads in the block are idle in kernel	NVSHMEM uses 128 threads, with no threads available to user for other ops in kernel	Prefer 1 thread based `put` over other two APIs for users who care about maximizing GPU thread ownership for non-comm ops in kernel
Number of Send Operations (ops)	1 x 4B send	1 x 4B send	128 x 4B sends with serial submission of commands to CPU proxy thread	Prefer 1 thread or 1 thread block based `put` over multi-threaded `put`, for minimizing number of send ops

Applications using device APIs over this transport can characterize peak bandwidth/latency metrics using ib_write_bw or ib_send_lat performance benchmark, available under linux-rdma/perftest repository.

The behavior of blocking versus non-blocking version (nvshmemx_TYPENAME_<ops>_nbi) of these APIs are different. For non-blocking version, the communication operation (put/get/p/g/etc) command is submitted to the runtime, and control is returned immediately back. If the application wants to ensure that the operation has completed, it must explicitly use nvshmem_quiet to guarantee completion of the operation at the initiator side. For the blocking version, the control is only returned back to the application, when the operation is completed at the initiator side.

Device APIs on IBGDA Transport

Some applications can improve performance for device API calls by adding transport awareness. For IBGDA transfers, the data transfer will be offloaded to the GPU threads on the device. By using multiple Queue-Pairs (QP) that are associated with multiple GPU threads (typically a 1:1 ratio), you can submit multiple message transfers to help accelerate the message rate faster than proxy-based transports. However, the IB bandwith for a given message size might become an issue. To achieve this peak IB bandwidth usage, and ensure that the IBGDA transport will be efficient over remote communication, the application must pack as much data as possible into large messages. For example, nvshmem_p might be inefficient for small message transfers because the IBGDA transport will not automatically coalesce data into a large message transfer. In this scenario, the application should pack data as much as possible and use nvshmem_put instead. For example, the following table describes the behavior difference between these APIs when trying to move an 128 element chunk from GPU A to GPU B over IBGDA transport:

Comparison between put/p/put_block APIs
Comparison Parameters	`nvshmem_float_put` on 1 thread	`nvshmem_float_put_block` on 1 thread block (128 threads)	`nvshmem_float_put/p` on 128 threads	Recommendation
Ownership of GPU threads	NVSHMEM exclusively uses 1 thread, while rest of the threads are available to user for other ops in kernel	NVSHMEM uses 128 threads, with no threads available to user for other ops in kernel	NVSHMEM uses 128 threads, with no threads available to user for other ops in kernel	Prefer 1 thread based `put` over other two API for users who care about maximizing GPU thread ownership for non-comm ops in kernel
Number of Send Operations (send)	1 x 4B send	1 x 4B send	128 x 4B sends with parallel submission of commands to one or more QPs (as defined by <CONN>_MAP_BY, CONN=DCI, RC)	Prefer 1 thread or 1 thread block based `put` over other multi-threaded `put`, for minimizing number of send ops

The applications can also modify the mapping between QPs to GPU threads and tune for better performance using the NVSHMEM_IBGDA_DCI_MAP_BY or NVSHMEM_IBGDA_RC_MAP_BY environment variables at runtime.

The behavior of blocking vs non-blocking version (nvshmemx_TYPENAME_<ops>_nbi) of these APIs are different. For the non-blocking version, after the API is invoked by the application, the communication operation (put/get/p/g/etc) command is submitted to the runtime, and control is returned immediately. If the application wants to ensure that the API has completed its operation, it must explicitly use nvshmem_quiet to guarantee completion of the operation at the initiator side. For the blocking version, after the API is invoked by the application, the communication operation command is submitted and internally waits for completion of the operation. The control is only returned to the application when the operation is completed at the initiator side.