PGAS Shared Memory Access Overview

NVIDIA HPC-X Software Toolkit Rev 2.16.2

The Shared Memory Access (SHMEM) routines provide low-latency, high-bandwidth communication for use in highly parallel scalable programs. The routines in the SHMEM Application Programming Interface (API) provide a programming model for exchanging data between cooperating parallel processes. The SHMEM API can be used either alone or in combination with MPI routines in the same parallel program.

The SHMEM parallel programming library is an easy-to-use programming model which uses highly efficient one-sided communication APIs to provide an intuitive global-view interface to shared or distributed memory systems. SHMEM's capabilities provide an excellent low-level interface for PGAS applications.

A SHMEM program is of a single program, multiple data (SPMD) style. All the SHMEM processes, referred to as processing elements (PEs), start simultaneously and run the same program. Commonly, the PEs perform computation on their own sub-domains of the larger problem, and periodically communicate with other PEs to exchange information on which the next communication phase depends.

The SHMEM routines minimize the overhead associated with data transfer requests, maximize bandwidth, and minimize data latency (the period of time that starts when a PE initiates a transfer of data and ends when a PE can use the data).

SHMEM routines support remote data transfer through:

  • “put” operations - data transfer to a different PE

  • “get” operations - data transfer from a different PE, and remote pointers, allowing direct references to data objects owned by another PE

Additional supported operations are collective broadcast and reduction, barrier synchronization, and atomic memory operations. An atomic memory operation is an atomic read-and-update operation, such as a fetch-and-increment, on a remote or local data object.

SHMEM libraries implement active messaging. The sending of data involves only one CPU where the source processor puts the data into the memory of the destination processor. Likewise, a processor can read data from another processor's memory without interrupting the remote CPU. The remote processor is unaware that its memory has been read or written unless the programmer implements a mechanism to accomplish this.

HPC-X Open MPI/OpenSHMEM programming library is a one-side communications library that supports a unique set of parallel programming features including point-to-point and collective routines, synchronizations, atomic operations, and a shared memory paradigm used between the processes of a parallel programming application.

HPC-X OpenSHMEM is based on the API defined by the OpenSHMEM.org consortium. The library works with the OpenFabrics RDMA for Linux stack (OFED), and also has the ability to utilize UCX (Unified Communication - X) and HCOLL, providing an unprecedented level of scalability for SHMEM programs running over InfiniBand.

Running HPC-X OpenSHMEM with UCX

Unified Communication - X Framework (UCX) is a new acceleration library, integrated into the Open MPI (as a pml layer) and to OpenSHMEM (as an spml layer) and available as part of HPC-X. It is an open source communication library designed to achieve the highest performance for HPC applications. UCX has a broad range of optimizations for achieving low-software overheads in communication path which allow near native-level performance.

UCX supports receive side tag matching, one-sided communication semantics, efficient memory registration and a variety of enhancements which increase the scalability and performance of HPC applications significantly.

UCX supports the following transports:

  • InfiniBand transports:

    • Unreliable Datagram (UD)

    • Reliable connected (RC)

    • Dynamically Connected (DC)

      Warning

      DC is supported on Connect-IB®/ConnectX®-4 and above HCAs with MLNX_OFED v2.1-1.0.0 and higher.

    • Accelerated verbs

  • Shared Memory communication with support for KNEM, CMA and XPMEM

  • RoCE

  • TCP

For further information on UCX, please refer to: https://github.com/openucx/ucx and http://www.openucx.org/

Enabling UCX for HPC-X OpenSHMEM Jobs

UCX is the default spml starting from HPC-X v2.1. For older versions of HPC-X, add the following MCA parameter to the oshrun command line:

Copy
Copied!
            

-mca spml ucx

All the UCX environment parameters can be used in the same way with oshrun, as well as with mpirun. For the complete list of the UCX environment parameters, please run:

Copy
Copied!
            

$HPCX_UCX_DIR/bin/ucx_info -f

Developing Application using HPC-X OpenSHMEM together with MPI

The SHMEM programming model can provide a means to improve the performance of latency-sensitive sections of an application. Commonly, this requires replacing MPI send/recv calls with shmem_put/ shmem_get and shmem_barrier calls. The SHMEM programming model can deliver significantly lower latencies for short messages than traditional MPI calls. An alternative to shmem_get /shmem_put calls can also be considered the MPI-2 MPI_Put/ MPI_Get functions.

An example of MPI-SHMEM mixed code:

Copy
Copied!
            

/* example.c */   #include <stdlib.h> #include <stdio.h> #include "shmem.h" #include "mpi.h" int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); start_pes(0);   { int version = 0; int subversion = 0; int num_proc = 0; int my_proc = 0; int comm_size = 0; int comm_rank = 0; MPI_Get_version(&version, &subversion); fprintf(stdout, "MPI version: %d.%d\n", version, subversion); num_proc = _num_pes(); my_proc = _my_pe(); fprintf(stdout, "PE#%d of %d\n", my_proc, num_proc); MPI_Comm_size(MPI_COMM_WORLD, &comm_size); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); fprintf(stdout, "Comm rank#%d of %d\n", comm_rank, comm_size); } return 0; }


HPC-X® OpenSHMEM Tunable Parameters

HPC-X® OpenSHMEM uses Modular Component Architecture (MCA) parameters to provide a way to tune your runtime environment. Each parameter corresponds to a specific function. The following are parameters that you can change their values to change the application’s function:

  • memheap - controls memory allocation policy and thresholds

  • scoll - controls HPC-X OpenSHMEM collective API threshold and algorithms

  • spml - controls HPC-X OpenSHMEM point-to-point transport logic and thresholds

  • atomic - controls HPC-X OpenSHMEM atomic operations logic and thresholds

  • shmem - controls general HPC-X OpenSHMEM API behavior

Procedure_Heading_Icon.PNG

To display HPC-X OpenSHMEM parameters:

  1. Print all available parameters. Run:

    Copy
    Copied!
                

    % oshmem_info -a

  2. Print HPC-X OpenSHMEM specific parameters. Run:

    Copy
    Copied!
                

    % oshmem_info --param shmem all % oshmem_info --param memheap all % oshmem_info --param scoll all % oshmem_info --param spml all % oshmem_info --param atomic all

    Warning

    It is required to drop_caches on all test machines before running OpenSHMEM application and/or benchmarks in order to free memory:

    echo 3 > /proc/sys/vm/drop_caches

OpenSHMEM MCA Parameters for Symmetric Heap Allocation

SHMEM memheap size can be modified by adding the SHMEM_SYMMETRIC_HEAP_SIZE parameter to the oshrun file. The default heap size is 256M.

Procedure_Heading_Icon.PNG

To run SHMEM with memheap size of 64M. Run:

Copy
Copied!
            

% oshrun -x SHMEM_SYMMETRIC_HEAP_SIZE=64M -np 512 -mca mpi_paffinity_alone 1 --map-by node -display-map -hostfile myhostfile example.exe

Memheap can be allocated with the following methods:

  • sysv - system V shared memory API. Allocation with hugepages is curently not supportedMemheap can be allocated with the following methods:

  • verbs - IB verbs allocator is used

  • mmap - mmap() is used to allocate memory

  • ucx - used to allocate and register memory via the UCX library

By default HPC-X OpenSHMEM will try a to find the best possible allocator. The priority is verbs, sysv, mmap and ucx. It is possible to choose a specific memheap allocation method by running -mca sshmem <name>

Parameters Used to Force Connection Creation

Commonly, SHMEM creates connection between PE lazily. That is at the sign of the first traffic.

To force connection creating during startup:

Procedure_Heading_Icon.PNG

  • Set the following MCA parameter.

    Copy
    Copied!
                

    mca shmem_preconnect_all 1

    Memory registration (ex: infiniband rkeys) information is exchanged between ranks during startup.

Procedure_Heading_Icon.PNG
To enable on-demand memory key exchange:

Set the following MCA parameter.

Copy
Copied!
            

mca shmalloc_use_modex 0


OpenSHMEM MCA Parameters for shmem_quiet, shmem_fence and shmem_barrier_all

Default synchronization algorithms of OSHMEM may be tuned by spml_ucx_strong_sync parameter:

0 - don't do strong synchronization (default)

1 - use non-blocking get

2 - use blocking get

3 - use flush operation

Warning

The procedures described below apply to user using MLNX_OFED 1.5.3.-3.0.0 only.

When using MLNX_OFED 1.5.3-3.0.0, it is recommended to change the MTU to 4k. Whereas in MLNX_OFED 3.1-x.x.x and above, the MTU is already set by default to 4k.

To check the current MTU support of an InfiniBand port, use

Procedure_Heading_Icon.PNG

the smpquery tool:

Copy
Copied!
            

# smpquery -D PortInfo 0 1 | grep -i mtu

If the MtuCap value is lower than 4K, enable it to 4K.

Assuming the firmware is configured to support 4K MTU, the actual MTU capability is further limited by the mlx4 driver parameter.

Procedure_Heading_Icon.PNG

To further tune it:

  1. Set the set_4k_mtu mlx4 driver parameter to 1 on all the cluster machines. For instance:

    Copy
    Copied!
                

    # echo "options mlx4_core set_4k_mtu=1" >> /etc/modprobe.d/mofed.conf

  2. Restart openibd.

    Copy
    Copied!
                

    # service openibd restart

Copy
Copied!
            

# cat /sys/module/mlx4_core/parameters/set_4k_mtu

To check whether the port was brought up with 4K MTU this time, use the smpquery tool again.

HPC Applications on Intel Sandy Bridge Machines

Intel Sandy Bridge machines have NUMA hardware related limitation which affects the performance of HPC jobs utilizing all node sockets. When installing MLNX_OFED 3.1-x.x.x, an automatic workaround is activated upon Sandy Bridge machine detection, and the following message is printed in the job`s standard output device: “mlx4: Sandy Bridge CPU was detected”

  • Set the SHELL environment variable before launching HPC application. Run:

    Copy
    Copied!
                

    % export MLX4_STALL_CQ_POLL=0 %oshrun <...>

    OR

    Copy
    Copied!
                

    oshrun -x MLX4_STALL_CQ_POLL=0 <other params>

© Copyright 2023, NVIDIA. Last updated on Oct 5, 2023.