ARM Memory Ordering#

The memory ordering describes in which order CPUs access memory and in which order these accesses can be observed by other CPUs. If the compiler or CPU reorder memory accesses for a single threaded application, they ensure that the result of the program will not change. Multithreaded applications however must explicitly make use of memory barriers to ensure correct operation.

Code for x86_64 also must keep memory ordering in mind, however some reorderings can not happen on x86_64, but can happen on ARM. This is why multithreaded code being ported from x86_64 to another architecture can show race conditions and other multithreading related bugs that were not present on x86_64.

Explicit memory barriers are provided by the compiler, e.g. as __sync_synchronize on gcc.

NVIDIA DGX Spark has support for the Large-System Extension (LSE), which was first introduced in Armv8.1. LSE provides low-cost atomic operations that can improve system throughput for thread communication, locks, and mutexes. On recent Arm64 CPUs, the improvement can be up to an order of magnitude when using LSE atomics instead of load/store exclusives.

When building an application from source, the compiler needs to generate LSE atomic instructions for applications that use atomic operations. For example, the code of databases such as PostgreSQL contain atomic constructs: C++11 code with std::atomic statements that translate into atomic operations. Since GCC 9.4, GCC’s -moutline-atomics or -march=armv8-a+lse flag enables LSE instructions (if -march=armv8.1-a or higher target architecture is used LSE will be included directly). To confirm that LSE instructions are created, the output of objdump command-line utility should contain LSE instructions:

$ objdump -d app | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l

To check whether the application binary contains load and store exclusives, run the following command:

$ objdump -d app | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l

See also: