PGAS Shared Memory Access Overview
The Shared Memory Access (SHMEM) routines provide low-latency, high-bandwidth communication for use in highly parallel scalable programs. The routines in the SHMEM Application Programming Interface (API) provide a programming model for exchanging data between cooperating parallel processes. The SHMEM API can be used either alone or in combination with MPI routines in the same parallel program.
The SHMEM parallel programming library is an easy-to-use programming model which uses highly efficient one-sided communication APIs to provide an intuitive global-view interface to shared or distributed memory systems. SHMEM's capabilities provide an excellent low-level interface for PGAS applications.
A SHMEM program is of a single program, multiple data (SPMD) style. All the SHMEM processes, referred to as processing elements (PEs), start simultaneously and run the same program. Commonly, the PEs perform computation on their own sub-domains of the larger problem, and periodically communicate with other PEs to exchange information on which the next communication phase depends.
The SHMEM routines minimize the overhead associated with data transfer requests, maximize bandwidth, and minimize data latency (the period of time that starts when a PE initiates a transfer of data and ends when a PE can use the data).
SHMEM routines support remote data transfer through:
“put” operations - data transfer to a different PE
“get” operations - data transfer from a different PE, and remote pointers, allowing direct references to data objects owned by another PE
Additional supported operations are collective broadcast and reduction, barrier synchronization, and atomic memory operations. An atomic memory operation is an atomic read-and-update operation, such as a fetch-and-increment, on a remote or local data object.
SHMEM libraries implement active messaging. The sending of data involves only one CPU where the source processor puts the data into the memory of the destination processor. Likewise, a processor can read data from another processor's memory without interrupting the remote CPU. The remote processor is unaware that its memory has been read or written unless the programmer implements a mechanism to accomplish this.
HPC-X Open MPI/OpenSHMEM programming library is a one-side communications library that supports a unique set of parallel programming features including point-to-point and collective routines, synchronizations, atomic operations, and a shared memory paradigm used between the processes of a parallel programming application.
HPC-X OpenSHMEM is based on the API defined by the OpenSHMEM.org consortium. The library works with the OpenFabrics RDMA for Linux stack (OFED), and also has the ability to utilize UCX (Unified Communication - X) and HCOLL, providing an unprecedented level of scalability for SHMEM programs running over InfiniBand.
Running HPC-X OpenSHMEM with UCX
Unified Communication - X Framework (UCX) is a new acceleration library, integrated into the Open MPI (as a pml layer) and to OpenSHMEM (as an spml layer) and available as part of HPC-X. It is an open source communication library designed to achieve the highest performance for HPC applications. UCX has a broad range of optimizations for achieving low-software overheads in communication path which allow near native-level performance.
UCX supports receive side tag matching, one-sided communication semantics, efficient memory registration and a variety of enhancements which increase the scalability and performance of HPC applications significantly.
UCX supports the following transports:
InfiniBand transports:
Unreliable Datagram (UD)
Reliable connected (RC)
Dynamically Connected (DC)
NoteDC is supported on Connect-IB®/ConnectX®-4 and above HCAs with MLNX_OFED v2.1-1.0.0 and higher.
Accelerated verbs
Shared Memory communication with support for KNEM, CMA and XPMEM
RoCE
TCP
For further information on UCX, please refer to: https://github.com/openucx/ucx and http://www.openucx.org/
Enabling UCX for HPC-X OpenSHMEM Jobs
UCX is the default spml starting from HPC-X v2.1. For older versions of HPC-X, add the following MCA parameter to the oshrun command line:
-mca spml ucx
All the UCX environment parameters can be used in the same way with oshrun, as well as with mpirun. For the complete list of the UCX environment parameters, please run:
$HPCX_UCX_DIR/bin/ucx_info -f
Developing Application using HPC-X OpenSHMEM together with MPI
The SHMEM programming model can provide a means to improve the performance of latency-sensitive sections of an application. Commonly, this requires replacing MPI send/recv calls with shmem_put/ shmem_get and shmem_barrier calls. The SHMEM programming model can deliver significantly lower latencies for short messages than traditional MPI calls. An alternative to shmem_get /shmem_put calls can also be considered the MPI-2 MPI_Put/ MPI_Get functions.
An example of MPI-SHMEM mixed code:
/* example.c */
#include <stdlib.h>
#include <stdio.h>
#include "shmem.h"
#include "mpi.h"
int
main(int
argc, char
*argv[])
{
MPI_Init(&argc, &argv);
start_pes(0
);
{
int
version = 0
;
int
subversion = 0
;
int
num_proc = 0
;
int
my_proc = 0
;
int
comm_size = 0
;
int
comm_rank = 0
;
MPI_Get_version(&version, &subversion);
fprintf(stdout, "MPI version: %d.%d\n"
, version, subversion);
num_proc = _num_pes();
my_proc = _my_pe();
fprintf(stdout, "PE#%d of %d\n"
, my_proc, num_proc);
MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);
fprintf(stdout, "Comm rank#%d of %d\n"
, comm_rank, comm_size);
}
return
0
;
}
HPC-X® OpenSHMEM Tunable Parameters
HPC-X® OpenSHMEM uses Modular Component Architecture (MCA) parameters to provide a way to tune your runtime environment. Each parameter corresponds to a specific function. The following are parameters that you can change their values to change the application’s function:
memheap - controls memory allocation policy and thresholds
scoll - controls HPC-X OpenSHMEM collective API threshold and algorithms
spml - controls HPC-X OpenSHMEM point-to-point transport logic and thresholds
atomic - controls HPC-X OpenSHMEM atomic operations logic and thresholds
shmem - controls general HPC-X OpenSHMEM API behavior
To display HPC-X OpenSHMEM parameters:
Print all available parameters. Run:
% oshmem_info -a
Print HPC-X OpenSHMEM specific parameters. Run:
% oshmem_info --param shmem all % oshmem_info --param memheap all % oshmem_info --param scoll all % oshmem_info --param spml all % oshmem_info --param atomic all
NoteIt is required to drop_caches on all test machines before running OpenSHMEM application and/or benchmarks in order to free memory:
echo 3 > /proc/sys/vm/drop_caches
OpenSHMEM MCA Parameters for Symmetric Heap Allocation
SHMEM memheap size can be modified by adding the SHMEM_SYMMETRIC_HEAP_SIZE parameter to the oshrun file. The default heap size is 256M.
To run SHMEM with memheap size of 64M. Run:
% oshrun -x SHMEM_SYMMETRIC_HEAP_SIZE=64M -np 512
-mca mpi_paffinity_alone 1
--map-by node -display-map -hostfile myhostfile example.exe
Memheap can be allocated with the following methods:
sysv - system V shared memory API. Allocation with hugepages is curently not supportedMemheap can be allocated with the following methods:
verbs - IB verbs allocator is used
mmap - mmap() is used to allocate memory
ucx - used to allocate and register memory via the UCX library
By default HPC-X OpenSHMEM will try a to find the best possible allocator. The priority is verbs, sysv, mmap and ucx. It is possible to choose a specific memheap allocation method by running -mca sshmem <name>
Parameters Used to Force Connection Creation
Commonly, SHMEM creates connection between PE lazily. That is at the sign of the first traffic.
To force connection creating during startup:
Set the following MCA parameter.
mca shmem_preconnect_all
1
Memory registration (ex: infiniband rkeys) information is exchanged between ranks during startup.
Set the following MCA parameter.
mca shmalloc_use_modex 0
OpenSHMEM MCA Parameters for shmem_quiet, shmem_fence and shmem_barrier_all
Default synchronization algorithms of OSHMEM may be tuned by spml_ucx_strong_sync parameter:
0 - don't do strong synchronization (default)
1 - use non-blocking get
2 - use blocking get
3 - use flush operation
The procedures described below apply to user using MLNX_OFED 1.5.3.-3.0.0 only.
When using MLNX_OFED 1.5.3-3.0.0, it is recommended to change the MTU to 4k. Whereas in MLNX_OFED 3.1-x.x.x and above, the MTU is already set by default to 4k.
To check the current MTU support of an InfiniBand port, use
the smpquery tool:
# smpquery -D PortInfo 0
1
| grep -i mtu
If the MtuCap value is lower than 4K, enable it to 4K.
Assuming the firmware is configured to support 4K MTU, the actual MTU capability is further limited by the mlx4 driver parameter.
To further tune it:
Set the set_4k_mtu mlx4 driver parameter to 1 on all the cluster machines. For instance:
# echo
"options mlx4_core set_4k_mtu=1"
>> /etc/modprobe.d/mofed.confRestart openibd.
# service openibd restart
# cat /sys/module/mlx4_core/parameters/set_4k_mtu
To check whether the port was brought up with 4K MTU this time, use the smpquery tool again.
HPC Applications on Intel Sandy Bridge Machines
Intel Sandy Bridge machines have NUMA hardware related limitation which affects the performance of HPC jobs utilizing all node sockets. When installing MLNX_OFED 3.1-x.x.x, an automatic workaround is activated upon Sandy Bridge machine detection, and the following message is printed in the job`s standard output device: “mlx4: Sandy Bridge CPU was detected”
Set the SHELL environment variable before launching HPC application. Run:
% export MLX4_STALL_CQ_POLL=
0
%oshrun <...>OR
oshrun -x MLX4_STALL_CQ_POLL=
0
<other params>