Device APIs#

This section describes some key device APIs considerations when developing applications using NVSHMEM runtime.

Device APIs on Peer-to-Peer Transport#

Some applications can improve performance for device API calls by adding transport awareness. For peer-to-peer (P2P) transfers, the device APIs will use the GPU’s Streaming Multiprocessorors (SMs) to move data. The best performance can be achieved by balancing the work per GPU thread across a pool of thread blocks. Using multiple thread blocks can improve the efficiency of data movement as compared to a single thread block, which is not sufficient to saturate the peer-to-peer communication bandwidth over NVIDIA (R) NVLink (R). The standard NVSHMEM APIs use one GPU thread to perform the data movement. When moving large amounts of data, one thread might not be efficient for peer-to-peer communication, so applications should use warp or block variants of this API, such as nvshmemx_TYPENAME_put_block or nvshmemx_TYPENAME_put_warp instead of nvshmem_TYPENAME_put. For example, the following table describes the behavior difference between these APIs when trying to move an 128 element chunk from GPU A to GPU B over P2P transport:

Comparison between put/p/put_block APIs#

Comparison Parameters

nvshmem_float_put_block on 1 thread block (128 threads)

nvshmem_float_put/p on 128 threads

Recommendation

Internal Synchronization

Epilog/prolog of this API would contain __sync_threads, which would result in additional latency of sync. For large enough data, this overhead is amortized.

Prefer put/p over put_block for latency parameter

Number of Memory Stores Operations (ops)

32 x 16B store, as internally 16B alignment store is preferred (>> 8B > 4B > 2B > 1B) to move data as efficiently as possible

128 x 4B store

Prefer put_block over put/p to minimize number of store ops

Automatic Transport Coalescing

Yes

Yes

Use any, for auto transport coalescing of contiguous elements in the chunk

Applications using device APIs over this transport can characterize its performance using shmem_put_bw performance benchmark, provided as part of the installation, to determine number of GPU thread warps and thread blocks to maximize NVLink utilization/bandwidth.

The application should also use spatial locality and data coalescing capabilities of the P2P by allocating or selecting contiguous addressable memory destinations for the target data before dispatching the data transfers using the device APIs. For example, nvshmem_p might be efficient for small message transfers if target addresses are contiguous in memory because P2P will automatically coalesce multiple, small messages into one large message transfer.

The behavior of blocking vs non-blocking version (nvshmemx_TYPENAME_<ops>_nbi) of these APIs are identical for this transport.

Device APIs on Proxy-Based Transport#

Some applications can improve performance for device API calls by adding transport awareness. For proxy-based transfers such as IBRC, UCX, libfabric, and so on, the data transfer will be offloaded to the NVSHMEM runtime internal single CPU proxy thread. The single CPU proxy thread might not be sufficient to drive small message transfers over the remote network, so application must pack as much data as possible into large messages for proxy-based transport. For example, nvshmem_p might not be efficient for small message transfers because the proxy-based transport will not automatically coalesce data into a large message transfer. In this scenario, the application should pack as much data as possible and use nvshmem_put instead. For example, the following table describes the behavior difference between these APIs when trying to move an 128 element chunk from GPU A to GPU B over proxy-based transport:

Comparison between put/p/put_block APIs#

Comparison Parameters

nvshmem_float_put on 1 thread

nvshmem_float_put_block on 1 thread block (128 threads)

nvshmem_float_put/p on 128 threads

Recommendation

Ownership of GPU threads

NVSHMEM exclusively uses 1 thread, while rest of the threads are available to user for other ops in kernel

NVSHMEM internally uses 1 thread to issue send op, while remaining threads in the block are idle in kernel

NVSHMEM uses 128 threads, with no threads available to user for other ops in kernel

Prefer 1 thread based put over other two APIs for users who care about maximizing GPU thread ownership for non-comm ops in kernel

Number of Send Operations (ops)

1 x 4B send

1 x 4B send

128 x 4B sends with serial submission of commands to CPU proxy thread

Prefer 1 thread or 1 thread block based put over multi-threaded put, for minimizing number of send ops

Applications using device APIs over this transport can characterize peak bandwidth/latency metrics using ib_write_bw or ib_send_lat performance benchmark, available under linux-rdma/perftest repository.

Device APIs on IBGDA Transport#

Some applications can improve performance for device API calls by adding transport awareness. For IBGDA transfers, the data transfer will be offloaded to the GPU threads on the device. By using multiple Queue-Pairs (QP) that are associated with multiple GPU threads (typically a 1:1 ratio), you can submit multiple message transfers to help accelerate the message rate faster than proxy-based transports. However, the IB bandwith for a given message size might become an issue. To achieve this peak IB bandwidth usage, and ensure that the IBGDA transport will be efficient over remote communication, the application must pack as much data as possible into large messages. For example, nvshmem_p might be inefficient for small message transfers because the IBGDA transport will not automatically coalesce data into a large message transfer. In this scenario, the application should pack data as much as possible and use nvshmem_put instead. For example, the following table describes the behavior difference between these APIs when trying to move an 128 element chunk from GPU A to GPU B over IBGDA transport:

Comparison between put/p/put_block APIs#

Comparison Parameters

nvshmem_float_put on 1 thread

nvshmem_float_put_block on 1 thread block (128 threads)

nvshmem_float_put/p on 128 threads

Recommendation

Ownership of GPU threads

NVSHMEM exclusively uses 1 thread, while rest of the threads are available to user for other ops in kernel

NVSHMEM uses 128 threads, with no threads available to user for other ops in kernel

NVSHMEM uses 128 threads, with no threads available to user for other ops in kernel

Prefer 1 thread based put over other two API for users who care about maximizing GPU thread ownership for non-comm ops in kernel

Number of Send Operations (send)

1 x 4B send

1 x 4B send

128 x 4B sends with parallel submission of commands to one or more QPs (as defined by <CONN>_MAP_BY, CONN=DCI, RC)

Prefer 1 thread or 1 thread block based put over other multi-threaded put, for minimizing number of send ops

The applications can also modify the mapping between QPs to GPU threads and tune for better performance using the NVSHMEM_IBGDA_DCI_MAP_BY or NVSHMEM_IBGDA_RC_MAP_BY environment variables at runtime.

On-stream APIs#

This section describes some key On-Stream APIs considerations when developing applications using NVSHMEM runtime.

On-stream APIs are provided as an extension for users to invoke device-side operations from the host, and the following API features can be uniquely useful:

  • When available, NVSHMEM will use NVIDIA Collective Communication Library (NCCL) for on-stream collectives. These collectives are highly optimized for host-submission and are recommended for applications composing existing NCCL-optimized workloads with pt-to-pt NVSHMEM operations.

  • Blocking and non-blocking variants of on-stream APIs are available. All on-stream APIs are asynchronous with respect to the host, but the non-blocking variants are enqueued on an internal stream and the execution order is controlled using cudaEvent. Stream order is guaranteed in both cases.

  • When issuing multiple operations in a kernel, the recommended method is to avoid using the device-side nvshmem_quiet API. Instead, users should call nvshmemx_quiet_on_stream after the kernel. This method avoids an expensive grid synchronization in the kernel.

Host APIs#

This section describes some key host APIs considerations when developing applications using NVSHMEM runtime.

Host APIs are provided for completeness, provide the same memory abstractions that the device and on-stream applications, but do not provide any performance benefits over performing the cudaMemcpyAsync calls, or performing corresponding network operations directly from the application. It might seem intuitive that the host APIs provides better submission performance on proxied transports, but NVSHMEM only supports the NVSHMEM_THREAD_SERIALIZED mode for host APIs. This means that applications can only drive traffic from one host thread without explicitly synchronizing.