NVSHMEM Performance

This section describes some key NVSHMEM performance considerations when developing applications using NVSHMEM runtime.

Using 16-Byte Alignment Buffers in the Application

The application will prefer to use allocating GPU buffers that are 16-byte aligned because, internally, the NVSHMEM runtime will try to complete the data transfer using 16-byte writes that are the more efficient than 8-byte or unaligned writes.

Using nvshmem_*block for Heterogenous Transports Configuration

If the application workflow matches the control flow, it will use collective calls such as nvshmem_<OPS>_block. These calls provide better performance portability between remote transports and communication over NVIDIA (R) NVLink (R) because NVLink benefits from using multiple threads to drive a higher transfer throughput.