Atomics Application Binary Interface
The mappings of programming languages’ atomic operations to the PTX ISA need to
be implemented in a consistent manner across all programming languages that may
concurrently access shared memory.
The mapping from C++11 atomics for the CUDA architecture are proven correct in
A Formal Analysis of the NVIDIA PTX Memory Consistency Model.
The PTX ISA provides atomic memory operations and fences for acquire, release,
acquire-release, and relaxed C++ memory ordering semantics.
Note
The memory order parameter is monotonic, and so it is valid to strengthen any such parameter. For example, it is valid to
strengthen fence.sc.<scope>; ld.relaxed.<scope>;
to fence.sc.<scope>; ld.acquire.<scope>;
for sequentially consistent
loads. The same applies for all mappings below.
Note
Where there is a choice of PTX ABI ISA mapping for a given C, C++, or CUDA C++ API it is acceptable to pick either and to
mix mappings within the same binary.
The PTX ABI for C++ sequentially consistent atomic operations is the following:
C or C++ or CUDA C++ API |
PTX ABI ISA mapping |
atomic_thread_fence(memory_order_seq_cst, thread_scope_<scope>)
|
fence.sc.<scope>;
|
atomic_load(memory_order_seq_cst, thread_scope_<scope>)
|
fence.sc.<scope>; ld.relaxed.<scope>;
|
atomic_store(memory_order_seq_cst, thread_scope_<scope>)
|
fence.sc.<scope>; st.relaxed.<scope>;
|
atomic_<rmw op>(memory_order_seq_cst, thread_scope_<scope>)
|
fence.sc.<scope>; atom.acquire.<scope>.<rmw op>;
|
The PTX ABI for C++ release atomic operations is the following:
C or C++ or CUDA C++ API |
PTX ABI ISA mapping |
atomic_thread_fence(memory_order_release, thread_scope_<scope>)
|
fence.release.<scope>;
|
atomic_store(memory_order_release, thread_scope_<scope>)
|
fence.release.<scope>; st.relaxed.<scope>;
|
atomic_store(memory_order_release, thread_scope_<scope>)
|
st.release.<scope>;
|
atomic_<rmw op>(memory_order_release, thread_scope_<scope>)
|
fence.release.<scope>; atom.relaxed.<scope>.<rmw op>;
|
atomic_<rmw op>(memory_order_release, thread_scope_<scope>)
|
atom.release.<scope>.<rmw op>;
|
The PTX ABI for C++ acquire atomic operations is the following:
C or C++ or CUDA C++ API |
PTX ABI ISA mapping |
atomic_thread_fence(memory_order_acquire, thread_scope_<scope>)
|
fence.acquire.<scope>;
|
atomic_load(memory_order_acquire, thread_scope_<scope>)
|
ld.relaxed.<scope>; fence.acquire.<scope>;
|
atomic_load(memory_order_acquire, thread_scope_<scope>)
|
ld.acquire.<scope>;
|
atomic_<rmw op>(memory_order_acquire, thread_scope_<scope>)
|
atom.relaxed.<scope>.<rmw op>; fence.acquire.<scope>;
|
atomic_<rmw op>(memory_order_acquire, thread_scope_<scope>)
|
atom.acquire.<scope>.<rmw op>;
|
The PTX ABI for C++ acquire-release atomic operations is the following:
C or C++ or CUDA C++ API |
PTX ABI ISA mapping |
atomic_thread_fence(memory_order_acq_rel, thread_scope_<scope>)
|
fence.acq_rel.<scope>;
|
atomic_<rmw op>(memory_order_acq_rel, thread_scope_<scope>)
|
atom.acq_rel.<scope>.<rmw op>; OR
fence.release.<scope>; atom.acquire.<scope>.<rmw op>;
|
The PTX ABI for C++ relaxed atomic operations is the following:
C or C++ or CUDA C++ API |
PTX ABI ISA mapping |
atomic_load(memory_order_relaxed, thread_scope_<scope>)
|
ld.relaxed.<scope>;
|
atomic_store(memory_order_relaxed, thread_scope_<scope>)
|
st.relaxed.<scope>;
|
atomic_<rmw op>(memory_order_relaxed, thread_scope_<scope>)
|
atom.relaxed.<scope>.<rmw op>;
|