NVSHMEM Performance
This section describes some key NVSHMEM performance considerations when developing applications using NVSHMEM runtime.
Using 16-Byte Alignment Buffers in the Application
The application will prefer to use allocating GPU buffers that are 16-byte aligned because, internally, the NVSHMEM runtime will try to complete the data transfer using 16-byte writes that are the more efficient than 8-byte or unaligned writes.
Using nvshmem_*block for Heterogenous Transports Configuration
If the application workflow matches the control flow, it will use collective calls such as nvshmem_<OPS>_block
. These calls provide better performance portability between remote transports and communication over NVIDIA (R) NVLink (R) because NVLink benefits from using multiple threads to drive a higher transfer throughput.