Memory Operations
This section describes the operations for performing scatter and gather style memory accesses. A gather load or scatter store operation receives a tile of pointers which may reference discontiguous regions of memory. Each pointer is loaded or stored by the operation simultaneously.
Example
A sparse collection of elements in the \(4 \times 4\) array is loaded into a dense tile.
namespace ct = ::cuda::tiles;
int x[4][4] {
{0, 1, 2, 3},
{4, 5, 6, 7},
{8, 9, 10, 11},
{12, 13, 14, 15}
};
using i32x2x2 = ct::tile<int, ct::shape<2, 2>>;
using ptr_2x2 = ct::tile<int*, ct::shape<2, 2>>;
// [2, 11, 4, 13]
i32x2x2 idx = (2 + 9 * ct::iota<i32x2x2>()) % 16;
// [p + 2, p + 11, p + 4, p + 13]
ptr_2x2 ptrs = &x[0][0] + idx;
// [*(p + 2), *(p + 11), *(p + 4), *(p + 13)]
i32x2x2 r = ct::load(ptrs);
cuda::tiles::loadable_tile
-
template<typename T>
concept loadable_tile = ct::pointer_tile<T> && /* atomic constraint */
-
Indicates whether a load may occur on a pointer tile object of type \(T\).
The atomic constraint validates that:
The element type of \(T\) is not a pointer to (possibly cv-qualified)
void.The pointee type of the element type of \(T\) is not volatile qualified.
cuda::tiles::storeable_tile
-
template<typename T>
concept storeable_tile = ct::loadable_tile<T> && /* atomic constraint */
-
Indicates whether a store may occur on a pointer tile object of type \(T\).
The atomic constraint validates that the pointee type of the element type of \(T\) is not const qualified.
cuda::tiles::tile_load_t
-
template<ct::loadable_tile T>
using tile_load_t = ct::tile_with_element_t<T, remove-cv-t<remove-pointer-t<ct::tile_element_t<T>>>>
-
Yields the result type when loading from a pointer tile object of type \(T\).
Load Operations
-
template<
ct::loadable_tile Tile
>
__tile__ ct::tile_load_t<Tile> load(Tile ptrs) noexcept;
-
template<
ct::loadable_tile Tile,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Tile>>
__tile__ ct::tile_load_t<Tile> load_masked(Tile ptrs, Mask mask) noexcept;
-
template<
ct::loadable_tile Tile,
ct::bool_tile_convertible Mask,
ct::tile_like Padding
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Tile>> && ct::broadcastable_to<Padding, ct::tile_shape_t<Tile>> && /* atomic constraint*/
__tile__ ct::tile_load_t<Tile> load_masked(Tile ptrs, Mask mask, Padding padding) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::loadable_tile Tile
>
requires ct::read_memory_order<Order>
__tile__ ct::tile_load_t<Tile> atomic_load(Tile ptrs, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::loadable_tile Tile,
ct::bool_tile_convertible Mask
>
requires (ct::broadcastable_to<Mask, ct::tile_shape_t<Tile>> && ct::read_memory_order<Order>)
__tile__ ct::tile_load_t<Tile> atomic_load_masked(Tile ptrs, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::loadable_tile Tile,
ct::bool_tile_convertible Mask,
ct::tile_like Padding
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Tile>> && ct::broadcastable_to<Padding, ct::tile_shape_t<Tile>> && ct::read_memory_order<Order> && /* atomic constraint */
__tile__ ct::tile_load_t<Tile> atomic_load_masked(Tile ptrs, Mask mask, Padding padding, ct::memory_order_constant<Order> order = {}, ct::thread_scope_constant<Scope> scope = {}) noexcept;
-
Loads the values specified by the pointers in
ptrsinto a tile. A load may be masked bymaskin which case the corresponding result value is either unspecified or supplied by thepaddingargument.When present, the
maskargument undergoes bool tile conversion followed by broadcast conversion to the shape ofptrs.When present, the
paddingargument is broadcast converted to the shape ofptrsand then tile converted to the type ct::tile_load_t<Tile>.Let \(m\) and \(p\) denote the converted
maskandpadding.The result is a tile \(a\) whose value at index \(J\) is determined as follows:
If the mask \(m(J)\) is
trueormaskis absent, the value is the result of dereferencing the pointer \(\text{ptrs}(J)\). If this results in undefined behavior for any index \(J\), the behavior of the operation as a whole is undefined.If \(m(J)\) is
falseandpaddingis absent, the value is unspecified.Otherwise, \(m(J)\) is
falseandpaddingis specified. The result is the padding value \(p(J)\).
An invocation generates a read memory operation on each location specified by
ptrsfor which the corresponding value in \(m\) istrueor otherwise not present.For the
atomic_loadandatomic_load_maskedoverloads, the generated memory operations are strong and have a memory order and thread scope specified byOrderandScoperespectively.The latency optimization hint may appertain to direct call expressions of the above load APIs.
The atomic constraint validates that the element type of
paddingis scalar convertible to the element type of ct::tile_load_t<Tile>.Example
The following example shows a masked load with both atomic and non-atomic variants. The pointers for which
maskisfalseare not loaded and their values in the result are taken from thepadding.namespace ct = ::cuda::tiles; int data[4] = { 2, 7, 5, 8 }; bool maskData[4] = {true, false, false, true}; int padData[4] = {-7, -3, -22, -100}; using i32x4 = ct::tile<int, ct::shape<4>>; auto mask = ct::load(&maskData[0] + ct::iota<i32x4>()); auto padding = ct::load(&padData[0] + ct::iota<i32x4>()); auto ptrs = &data[0] + ct::iota<i32x4>(); // Non-atomic auto r0 = ct::load_masked(ptrs, mask, padding); // Atomic auto r1 = ct::atomic_load_masked(ptrs, mask, padding, ct::memory_order_relaxed_t{}, ct::thread_scope_device_t{});
\[\begin{pmatrix} 2 & 7 & 5 & 8 \end{pmatrix} \rightarrow \begin{pmatrix} 2 & -3 & -22 & 8 \end{pmatrix}\]
Store Operations
-
template<
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Value
>
__tile__ void store(Ptrs ptrs, Value value) noexcept;
-
template<
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Value,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>>
__tile__ void store_masked(Ptrs ptrs, Value value, Mask mask) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Value
>
requires ct::write_memory_order<Order>
__tile__ void atomic_store(Ptrs ptrs, Value value, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Value,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && ct::write_memory_order<Order>
__tile__ void atomic_store_masked(Ptrs ptrs, Value value, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
Stores the elements specified by
valueto the memory locations of the corresponding elements ofptrs. A store may be inhibited by specifyingmaskvalues for the corresponding elements inptrs.The
valueargument undergoes tile conversion to the type tile_load_t<Ptrs>.When present,
maskundergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.Let \(v\) and \(m\) denote the converted
valueandmaskarguments respectively and let \(J\) be an index intoptrs.If
maskis absent or \(m(J)\) istrue, the value \(v(J)\) is stored at the memory location specified by \(\text{ptrs}(J)\). Otherwise, no store occurs for that index.An invocation generates a write memory operation for each location specified by
ptrswhose corresponding element in \(m\) is notfalse. For theatomic_storeandatomic_store_maskedvariants, the generated memory operations are strong and have memory order and thread scope specified byOrderandScoperespectively.The latency optimization hint may appertain to direct call expressions of the above store APIs.
Note
Multiple memory operations may be generated on a single memory location if the same pointer value is present more than once in
ptrs. This results in undefined behavior for the non-atomic overloads.Example
namespace ct = ::cuda::tiles; using i32x4 = ct::tile<int, ct::shape<4>>; int data[4] = { 0, 1, 2, 3 }; bool maskData[4] = {true, false, false, true}; using i32x4 = ct::tile<int, ct::shape<4>>; auto mask = ct::load(&maskData[0] + ct::iota<i32x4>()); auto value = ct::full<i32x4>(-1); auto ptrs = &data[0] + ct::iota<i32x4>(); // Non-atomic ct::store_masked(ptrs, value, mask); // Atomic ct::atomic_store_masked(ptrs, value, mask, ct::memory_order_relaxed_t{}, ct::thread_scope_device_t{});
\[\begin{pmatrix} 0 & 1 & 2 & 3 \end{pmatrix} \rightarrow \begin{pmatrix} -1 & 1 & 2 & -1 \end{pmatrix}\]
cuda::tiles::atomic_compare_exchange
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Cmp,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Val
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_compare_exchange(Ptrs ptrs, Cmp cmp, Val val, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Cmp,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Val,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_compare_exchange_masked(Ptrs ptrs, Cmp cmps, Val vals, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
Performs an elementwise atomic compare exchange of the memory locations of
ptrs.The values
cmpsandvalsundergo tile conversion to the type ct::tile_load_t<Ptrs>. Ifmaskis specified, it undergoes bool tile conversion followed by broadcast conversion to the shape ofptrs.Let \(p\), \(c\), \(v\), and \(m\) denote corresponding elements of the converted operands
ptrs,cmp,val, andmaskrespectively.When \(m\) is
trueormaskis not specified, the value representations of \(c\) and the object at location \(p\) are compared. If they are bitwise identical, \(v\) is stored to the address \(p\). The corresponding element of the result is the value under \(p\) that was used for the comparison check.Otherwise, \(m\) is
falseand no read, comparison, or store occurs. The corresponding element of the result is the comparison value \(c\).A call to this API generates a strong read-write memory operation for each memory location \(p\) for which the mask \(m\) is
trueor not specified. The value read by the memory operation is the value under \(p\) that was used in the comparison check. The value written by the memory operation is \(v\) if the comparison succeeds or the value under \(p\) that was used for the comparison check otherwise.The memory order and scope of these memory operations are specified by
OrderandScoperespectively.The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is one of
A 32 bit or 64 bit integral scalar
A
floatordouble
Note
A read-write memory operation is generated for a given element even if the comparison failed. In this scenario, the value written is the value that was read for the comparison check.
A caller may test if the compare and exchange succeeded by performing a bitwise comparison of the return value with the converted
cmp. Equality comparison may behave differently than bitwise comparison for floating point values.
cuda::tiles::atomic_and
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_and(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_and_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
Performs elementwise atomic bitwise AND on the memory locations of
ptrs.The
valuesoperand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. Ifmaskis present, it undergoes bool tile conversion followed by broadcast conversion to the shape ofptrs.Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands
ptrs,values, andmaskwhen present.If \(m\) is
trueor not present, \(p\) is loaded to produce a value \(k\). The bitwise AND between \(v\) and \(k\) is computed as if by invoking ct::operator&(k, v) and the result is stored to \(p\). The corresponding element in the return value is \(k\)Otherwise, \(m\) is
falseand no read, computation, or write is performed. The corresponding element in the return value is unspecified.A call to this API generates a strong read-write memory operation for each address of
ptrswhich is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined byOrderandScoperespectively.The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is a \(32\) or \(64\) bit integral scalar type.
cuda::tiles::atomic_or
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_or(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_or_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
Performs elementwise atomic bitwise OR on the memory locations of
ptrs.The
valuesoperand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. Ifmaskis present, it undergoes bool tile conversion followed by broadcast conversion to the shape ofptrs.Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands
ptrs,values, andmaskwhen present.If \(m\) is
trueor not present, \(p\) is loaded to produce a value \(k\). The bitwise OR between \(v\) and \(k\) is computed as if by invoking ct::operator|(k, v) and the result is stored to \(p\). The corresponding element in the return value is \(k\)Otherwise, \(m\) is
falseand no read, computation, or write is performed. The corresponding element in the return value is unspecified.A call to this API generates a strong read-write memory operation for each address of
ptrswhich is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined byOrderandScoperespectively.The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is a \(32\) or \(64\) bit integral scalar type.
cuda::tiles::atomic_xor
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_xor(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_xor_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
Performs elementwise atomic bitwise XOR on the memory locations of
ptrs.The
valuesoperand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. Ifmaskis present, it undergoes bool tile conversion followed by broadcast conversion to the shape ofptrs.Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands
ptrs,values, andmaskwhen present.If \(m\) is
trueor not present, \(p\) is loaded to produce a value \(k\). The bitwise XOR between \(v\) and \(k\) is computed as if by invoking ct::operator^(k, v) and the result is stored to \(p\). The corresponding element in the return value is \(k\)Otherwise, \(m\) is
falseand no read, computation, or write is performed. The corresponding element in the return value is unspecified.A call to this API generates a strong read-write memory operation for each address of
ptrswhich is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined byOrderandScoperespectively.The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is a \(32\) or \(64\) bit integral scalar type.
cuda::tiles::atomic_max
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_max(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_max_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
Performs elementwise atomic maximum on the memory locations of
ptrs.The
valuesoperand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. Ifmaskis present, it undergoes bool tile conversion followed by broadcast conversion to the shape ofptrs.Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands
ptrs,values, andmaskwhen present.If \(m\) is
trueor not present, \(p\) is loaded to produce a value \(k\). The maximum between \(v\) and \(k\) is computed as if by invoking ct::max(k, v) and the result is stored to \(p\). The corresponding element in the return value is \(k\).Otherwise, \(m\) is
falseand no read, computation, or write is performed. The corresponding element in the return value is unspecified.A call to this API generates a strong read-write memory operation for each address of
ptrswhich is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined byOrderandScoperespectively.The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is a \(32\) or \(64\) bit integral scalar type.
cuda::tiles::atomic_min
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_min(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_min_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
Performs elementwise atomic minimum on the memory locations of
ptrs.The
valuesoperand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. Ifmaskis present, it undergoes bool tile conversion followed by broadcast conversion to the shape ofptrs.Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands
ptrs,values, andmaskwhen present.If \(m\) is
trueor not present, \(p\) is loaded to produce a value \(k\). The minimum between \(v\) and \(k\) is computed as if by invoking ct::min(k, v) and the result is stored to \(p\). The corresponding element in the return value is \(k\).Otherwise, \(m\) is
falseand no read, computation, or write is performed. The corresponding element in the return value is unspecified.A call to this API generates a strong read-write memory operation for each address of
ptrswhich is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined byOrderandScoperespectively.The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is a \(32\) or \(64\) bit integral scalar type.
cuda::tiles::atomic_add
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_add(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_add_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
Performs elementwise atomic addition on the memory locations of
ptrs.The
valuesoperand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. Ifmaskis present, it undergoes bool tile conversion followed by broadcast conversion to the shape ofptrs.Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands
ptrs,values, andmaskwhen present.If \(m\) is
trueor not present, \(p\) is loaded to produce a value \(k\). The sum between \(v\) and \(k\) is computed as if by invokingIf
vis a integral scalar: ct::add(k, v). If this produces undefined behavior, the behavior of the operation as a whole is undefined.-
If
vis a basic floating point scalar:ct::add<ct::rounding_mode::round_ties_to_even, SubMode>(k, v)
where the value of
SubModeis not specified.
The result of the computation is stored to \(p\). The corresponding element in the return value is \(k\).
Otherwise, \(m\) is
falseand no read, computation, or write is performed. The corresponding element in the return value is unspecified.A call to this API generates a strong read-write memory operation for each address of
ptrswhich is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined byOrderandScoperespectively.The atomic constraint validates that element type of ct::tile_load_t<Ptrs> is one of
A \(32\) or \(64\) bit integral scalar type
double,float, or__half
cuda::tiles::atomic_sub
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_sub(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_sub_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
Performs elementwise atomic subtraction on the memory locations of
ptrs.The
valuesoperand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. Ifmaskis present, it undergoes bool tile conversion followed by broadcast conversion to the shape ofptrs.Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands
ptrs,values, andmaskwhen present.If \(m\) is
trueor not present, \(p\) is loaded to produce a value \(k\). The difference between \(v\) and \(k\) is computed as if by invoking-
If
vis a integral scalar: ct::add(k, -v). If this produces undefined behavior, the behavior of the operation as a whole is undefined.Note
The behavior is undefined if \(v\) is a signed integral type and overflow occurs in the unary negation even if the difference between \(k\) and \(v\) is representable in the target integer type.
-
If
vis a basic floating point scalar:ct::sub<ct::rounding_mode::round_ties_to_even, SubMode>(k, v)
where the value of
SubModeis not specified.
The result of the computation is stored to \(p\). The corresponding element in the return value is \(k\).
Otherwise, \(m\) is
falseand no read, computation, or write is performed. The corresponding element in the return value is unspecified.A call to this API generates a strong read-write memory operation for each address of
ptrswhich is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined byOrderandScoperespectively.The atomic constraint validates that element type of ct::tile_load_t<Ptrs> is one of
A \(32\) or \(64\) bit integral scalar type
double,float, or__half
-
cuda::tiles::atomic_xchg
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_xchg(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_xchg_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
-
Performs elementwise atomic exchange on the memory locations of
ptrs.The
valuesoperand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. Ifmaskis present, it undergoes bool tile conversion followed by broadcast conversion to the shape ofptrs.Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands
ptrs,values, andmaskwhen present.If \(m\) is
trueor not present, \(p\) is loaded to produce a value \(k\) and \(v\) is subsequently stored to \(p\). The corresponding element in the return value is \(k\).Otherwise, \(m\) is
falseand no read, computation, or write is performed. The corresponding element in the return value is unspecified.A call to this API generates a strong read-write memory operation for each address of
ptrswhich is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined byOrderandScoperespectively.The atomic constraint validates that element type of ct::tile_load_t<Ptrs> is one of
A \(32\) or \(64\) bit integral scalar type
double, orfloat