NVSHMEM Performance#

This section describes some key NVSHMEM performance considerations when developing applications using NVSHMEM runtime.

Using 16-Byte Alignment Buffers in the Application#

The application will prefer to use allocating GPU buffers that are 16-byte aligned because, internally, the NVSHMEM runtime will try to complete the data transfer using 16-byte writes that are the more efficient than 8-byte or unaligned writes.

Using nvshmem_*block for Heterogenous Transports Configuration#

If the application workflow matches the control flow, it will use collective calls such as nvshmem_<OPS>_block. These calls provide better performance portability between remote transports and communication over NVIDIA (R) NVLink (R) because NVLink benefits from using multiple threads to drive a higher transfer throughput.

Tuning the queue-pair Type and Configuration for IBGDA#

The NVSHMEM IBGDA transport is an implementation of the GPUDirect Async Kernel-initiated technology (GDA-KI) for Mellanox Connect-X devices.

GDA-KI offloads the entire process of sending memory over the remote network to the GPU. This transport differs from other transports that rely on passing messages to a CPU proxy thread to initiate the transfer. Submitting the transfer directly from the GPU allows for a highly parallel submission of requests to the network interface. The increased message rate for small messages allows users to saturate the network bandwidth using a much smaller data buffer.

The increased parallelism comes with unique performance characteristics and challenges:

A thread on the GPU cannot fill out a work queue entry (WQE) as quickly as a CPU thread. This results in a higher latency for individual messages.
Providing a unique queue-pair (QP) for each GPU thread is not always feasible from a resource allocation perspective, and it increase quiet and fence latency. The IBGDA transport provides several tuning parameters to balance the trade-offs between memory consumption and performance.
The first set of options controls the type of QP used in the transport, and here are the options:
NVSHMEM_IBGDA_NUM_DCI: The number of Dynamic Connection Initiators (DCI) to create per PE.

Dynamically connected QPs allow for a DCI to connect to multiple remote Dynamic Connection Targets (DCT). In NVSHMEM, this means that each PE can have a DCI that connects to one or more DCTs on each remote PE. However, the flexibility and memory savings achieved by DCT are at the cost of reduced performance relative to a dedicated QP (for example, Reliable Connected or RC). Each time the DCI switches to a different DCT, the Connection must be renegotiated with a half-handshake protocol that adds latency to the transfer.

The default value is equal to the number of SMs on the device.

NVSHMEM_IBGDA_NUM_DCT: The number of DCTs created by each PE.

This value should be set to at least 2 to guarantee full use of the underlying NICs bandwidth.

The default value is 2.

NVSHMEM_IBGDA_NUM_RC_PER_PE: The number of Reliable Connected (RC) QPs on the local PE to create per remote PE.

RC QPs provide the lowest latency for message transfers. However, the memory overhead of creating a dedicated QP for each remote PE can be prohibitive for large-scale applications. For medium-scale and small-scale jobs, where total number of connections is not crucial, we recommend that you use RC QPs.

The default value is 2.
If DCI and RC are specified by users, the transport will only use RC for connections. If neither is specified, the transport will default to using DCI. Both options have different default values based on the mapping type, but they can be overridden by users. This is advisable in certain cases as outlined below.
- The second set of options can be combined with the first set to provide better performance for certain workload types or to reduce the total number of QPs required by the host and target. These options come in two flavors that correspond to the two types of QPs.
  
  NVSHMEM_IBGDA_DCI_MAP_BY includes the following options:
  
  CTA: Select which DCI(s) to use based on the CUDA thread block, which is also known as the Cooperative Thread Array (CTA) that calls the NVSHMEM API.
  
  A CTA is guaranteed to be scheduled to a Streaming Multiprocessor (SM), but multiple CTAs might exist on the same SM. This option is useful for applications that schedule workloads where multiple CTAs are scheduled per SM. If IBGDA_NUM_DCI is not set by users, NVSHMEM will by default create one DCI per SM. For example, when you are launching a maximum number of CTAs that are smaller than the total number of SMs, set the default value to the maximum number of CTAs that will be used simultaneously by users (Low Resource, Low Synchronization Overhead).
  
  SM: This option is similar to the previous option but uses the underlying hardware SM to determine which DCI to use.
  
  This option is useful for users who are scheduling work in units that match the size of an SM. Like the option above, if IBGDA_NUM_DCI is not set by users, NVSHMEM will by default create one DCI per SM. This option has additional overhead relative to the CTA option. The fact that multiple CTAs can map to a SM requires heavier synchronization when submitting operations to the QP. For CTA, you can use block-scoped atomics to avoid contention on the QP, but in the SM case, a full device atomic needs to be performed that has greater overhead.

Note

The CTA is the preferred option (Low Resource, High Synchronization Overhead).

WARP: This option is useful for applications that drive IO to multiple peers from a smaller subset of CTAs.

For example, in an all-to-all communication pattern with 32 warps per CTA and 32 peers, users can set the number of DCIs to 32 and use the WARP mapping to ensure that each warp has a dedicated DCI. By mapping this way, users can avoid the penalty of serializing submissions to the QP and renegotiating the connection for each warp submitting operations. This options is more memory intensive than the previous optons. If IBGDA_NUM_DCI is not set by users, NVSHMEM will by default create num_warps_per_sm * num_sms DCIs (High Resource, Low Synchronization Overhead).

DCT: If there are sufficient DCIs, this option locks each DCI to one specific DCT.

This option forces one-to-one mapping for DCIs to DCTs, which eliminates the need to reestablish a connection between iterations, even if the DCI is used by different warps CTAs, or SMs over time. Hwoever, some of the memory saving benefits of choosing DC over RC are lost. Users can specify fewer DCIs in NVSHMEM_IBGDA_NUM_DCI to complete a one-to-one mapping with DCTs. In this case, each DCI will be locked to a small subset of DCTs. Users might specify a smaller number of DCIs than DCTs, so this option is still more memory efficient than RC, which requires the creation and connection on both ends for each QP. This option also requires a device-scoped atomic to guarantee correctness (High Resource, High Synchronization Overhead).

The default value is CTA.

NVSHMEM_IBGDA_RC_MAP_BY includes the following options:

For these options, there is no default number of QPs created, so users must specify the number of QPs to create based on mapping type (RCs have lower latency than DCIs).

CTA: This is semantically the same as the DCI option.

SM: This is semantically the same as the DCI option.

WARP: This is semantically the same as the DCI option.

The default value is CTA.