Atomics Application Binary Interface

The mappings of programming languages’ atomic operations to the PTX ISA need to be implemented in a consistent manner across all programming languages that may concurrently access shared memory. The mapping from C++11 atomics for the CUDA architecture are proven correct in A Formal Analysis of the NVIDIA PTX Memory Consistency Model. The PTX ISA provides atomic memory operations and fences for acquire, release, acquire-release, and relaxed C++ memory ordering semantics. The PTX ABI for C++ sequentially consistent atomic operations is the following:

C or C++ or CUDA C++ API	PTX ABI ISA mapping
`atomic_thread_fence(memory_order_seq_cst, thread_scope_<scope>)`	`fence.sc.<scope>;`
`atomic_load(memory_order_seq_cst, thread_scope_<scope>)`	`fence.sc.<scope>; ld.acquire.<scope>;`
`atomic_store(memory_order_seq_cst, thread_scope_<scope>)`	`fence.sc.<scope>; st.release.<scope>;`
`atomic_<rmw op>(memory_order_seq_cst, thread_scope_<scope>)`	`fence.sc.<scope>; atom.acq_rel.<scope>.<rmw op>;`