Memory Operations

This section describes the operations for performing scatter and gather style memory accesses. A gather load or scatter store operation receives a tile of pointers which may reference discontiguous regions of memory. Each pointer is loaded or stored by the operation simultaneously.

Example

A sparse collection of elements in the \(4 \times 4\) array is loaded into a dense tile.

namespace ct = ::cuda::tiles;

int x[4][4] {
  {0,  1,  2,  3},
  {4,  5,  6,  7},
  {8,  9,  10, 11},
  {12, 13, 14, 15}
};

using i32x2x2 = ct::tile<int, ct::shape<2, 2>>;
using ptr_2x2 = ct::tile<int*, ct::shape<2, 2>>;

// [2, 11, 4, 13]
i32x2x2 idx = (2 + 9 * ct::iota<i32x2x2>()) % 16;

// [p + 2, p + 11, p + 4, p + 13]
ptr_2x2 ptrs = &x[0][0] + idx;

// [*(p + 2), *(p + 11), *(p + 4), *(p + 13)]
i32x2x2 r = ct::load(ptrs);
\[\begin{split}\begin{pmatrix} 0 & 1 & 2 & 3 \\ 4 & 5 & 6 & 7 \\ 8 & 9 & 10 & 11 \\ 12 & 13 & 14 & 15 \end{pmatrix} \rightarrow \begin{pmatrix} 2 & 11 \\ 4 & 13 \end{pmatrix}\end{split}\]

cuda::tiles::loadable_tile

template<typename T>
concept loadable_tile = ct::pointer_tile<T> && /* atomic constraint */

Indicates whether a load may occur on a pointer tile object of type \(T\).

The atomic constraint validates that:

  1. The element type of \(T\) is not a pointer to (possibly cv-qualified) void.

  2. The pointee type of the element type of \(T\) is not volatile qualified.

cuda::tiles::storeable_tile

template<typename T>
concept storeable_tile = ct::loadable_tile<T> && /* atomic constraint */

Indicates whether a store may occur on a pointer tile object of type \(T\).

The atomic constraint validates that the pointee type of the element type of \(T\) is not const qualified.

cuda::tiles::tile_load_t

template<ct::loadable_tile T>
using tile_load_t = ct::tile_with_element_t<T, remove-cv-t<remove-pointer-t<ct::tile_element_t<T>>>>

Yields the result type when loading from a pointer tile object of type \(T\).

Load Operations

template<
ct::loadable_tile Tile
>
__tile__ ct::tile_load_t<Tile> load(Tile ptrs) noexcept;
template<
ct::loadable_tile Tile,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Tile>>
__tile__ ct::tile_load_t<Tile> load_masked(Tile ptrs, Mask mask) noexcept;
template<
ct::loadable_tile Tile,
ct::bool_tile_convertible Mask,
ct::tile_like Padding
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Tile>> && ct::broadcastable_to<Padding, ct::tile_shape_t<Tile>> && /* atomic constraint*/
__tile__ ct::tile_load_t<Tile> load_masked(Tile ptrs, Mask mask, Padding padding) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::loadable_tile Tile
>
requires ct::read_memory_order<Order>
__tile__ ct::tile_load_t<Tile> atomic_load(Tile ptrs, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::loadable_tile Tile,
ct::bool_tile_convertible Mask
>
requires (ct::broadcastable_to<Mask, ct::tile_shape_t<Tile>> && ct::read_memory_order<Order>)
__tile__ ct::tile_load_t<Tile> atomic_load_masked(Tile ptrs, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::loadable_tile Tile,
ct::bool_tile_convertible Mask,
ct::tile_like Padding
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Tile>> && ct::broadcastable_to<Padding, ct::tile_shape_t<Tile>> && ct::read_memory_order<Order> && /* atomic constraint */
__tile__ ct::tile_load_t<Tile> atomic_load_masked(Tile ptrs, Mask mask, Padding padding, ct::memory_order_constant<Order> order = {}, ct::thread_scope_constant<Scope> scope = {}) noexcept;

Loads the values specified by the pointers in ptrs into a tile. A load may be masked by mask in which case the corresponding result value is either unspecified or supplied by the padding argument.

When present, the mask argument undergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.

When present, the padding argument is broadcast converted to the shape of ptrs and then tile converted to the type ct::tile_load_t<Tile>.

Let \(m\) and \(p\) denote the converted mask and padding.

The result is a tile \(a\) whose value at index \(J\) is determined as follows:

  1. If the mask \(m(J)\) is true or mask is absent, the value is the result of dereferencing the pointer \(\text{ptrs}(J)\). If this results in undefined behavior for any index \(J\), the behavior of the operation as a whole is undefined.

  2. If \(m(J)\) is false and padding is absent, the value is unspecified.

  3. Otherwise, \(m(J)\) is false and padding is specified. The result is the padding value \(p(J)\).

An invocation generates a read memory operation on each location specified by ptrs for which the corresponding value in \(m\) is true or otherwise not present.

For the atomic_load and atomic_load_masked overloads, the generated memory operations are strong and have a memory order and thread scope specified by Order and Scope respectively.

The latency optimization hint may appertain to direct call expressions of the above load APIs.

The atomic constraint validates that the element type of padding is scalar convertible to the element type of ct::tile_load_t<Tile>.

Example

The following example shows a masked load with both atomic and non-atomic variants. The pointers for which mask is false are not loaded and their values in the result are taken from the padding.

namespace ct = ::cuda::tiles;

int data[4] = { 2, 7, 5, 8 };
bool maskData[4] = {true, false, false, true};
int padData[4] = {-7, -3, -22, -100};

using i32x4 = ct::tile<int, ct::shape<4>>;

auto mask = ct::load(&maskData[0] + ct::iota<i32x4>());
auto padding = ct::load(&padData[0] + ct::iota<i32x4>());
auto ptrs = &data[0] + ct::iota<i32x4>();

// Non-atomic
auto r0 = ct::load_masked(ptrs, mask, padding);

// Atomic
auto r1 = ct::atomic_load_masked(ptrs, mask, padding,
                                 ct::memory_order_relaxed_t{},
                                 ct::thread_scope_device_t{});
\[\begin{pmatrix} 2 & 7 & 5 & 8 \end{pmatrix} \rightarrow \begin{pmatrix} 2 & -3 & -22 & 8 \end{pmatrix}\]

Store Operations

template<
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Value
>
__tile__ void store(Ptrs ptrs, Value value) noexcept;
template<
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Value,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>>
__tile__ void store_masked(Ptrs ptrs, Value value, Mask mask) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Value
>
requires ct::write_memory_order<Order>
__tile__ void atomic_store(Ptrs ptrs, Value value, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Value,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && ct::write_memory_order<Order>
__tile__ void atomic_store_masked(Ptrs ptrs, Value value, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;

Stores the elements specified by value to the memory locations of the corresponding elements of ptrs. A store may be inhibited by specifying mask values for the corresponding elements in ptrs.

The value argument undergoes tile conversion to the type tile_load_t<Ptrs>.

When present, mask undergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.

Let \(v\) and \(m\) denote the converted value and mask arguments respectively and let \(J\) be an index into ptrs.

If mask is absent or \(m(J)\) is true, the value \(v(J)\) is stored at the memory location specified by \(\text{ptrs}(J)\). Otherwise, no store occurs for that index.

An invocation generates a write memory operation for each location specified by ptrs whose corresponding element in \(m\) is not false. For the atomic_store and atomic_store_masked variants, the generated memory operations are strong and have memory order and thread scope specified by Order and Scope respectively.

The latency optimization hint may appertain to direct call expressions of the above store APIs.

Note

Multiple memory operations may be generated on a single memory location if the same pointer value is present more than once in ptrs. This results in undefined behavior for the non-atomic overloads.

Example

namespace ct = ::cuda::tiles;
using i32x4 = ct::tile<int, ct::shape<4>>;

int data[4] = { 0, 1, 2, 3 };
bool maskData[4] = {true, false, false, true};

using i32x4 = ct::tile<int, ct::shape<4>>;

auto mask = ct::load(&maskData[0] + ct::iota<i32x4>());
auto value = ct::full<i32x4>(-1);
auto ptrs = &data[0] + ct::iota<i32x4>();

// Non-atomic
ct::store_masked(ptrs, value, mask);

// Atomic
ct::atomic_store_masked(ptrs, value, mask,
                        ct::memory_order_relaxed_t{},
                        ct::thread_scope_device_t{});
\[\begin{pmatrix} 0 & 1 & 2 & 3 \end{pmatrix} \rightarrow \begin{pmatrix} -1 & 1 & 2 & -1 \end{pmatrix}\]

cuda::tiles::atomic_compare_exchange

template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Cmp,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Val
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_compare_exchange(Ptrs ptrs, Cmp cmp, Val val, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Cmp,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Val,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_compare_exchange_masked(Ptrs ptrs, Cmp cmps, Val vals, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;

Performs an elementwise atomic compare exchange of the memory locations of ptrs.

The values cmps and vals undergo tile conversion to the type ct::tile_load_t<Ptrs>. If mask is specified, it undergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.

Let \(p\), \(c\), \(v\), and \(m\) denote corresponding elements of the converted operands ptrs, cmp, val, and mask respectively.

When \(m\) is true or mask is not specified, the value representations of \(c\) and the object at location \(p\) are compared. If they are bitwise identical, \(v\) is stored to the address \(p\). The corresponding element of the result is the value under \(p\) that was used for the comparison check.

Otherwise, \(m\) is false and no read, comparison, or store occurs. The corresponding element of the result is the comparison value \(c\).

A call to this API generates a strong read-write memory operation for each memory location \(p\) for which the mask \(m\) is true or not specified. The value read by the memory operation is the value under \(p\) that was used in the comparison check. The value written by the memory operation is \(v\) if the comparison succeeds or the value under \(p\) that was used for the comparison check otherwise.

The memory order and scope of these memory operations are specified by Order and Scope respectively.

The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is one of

  1. A 32 bit or 64 bit integral scalar

  2. A float or double

Note

A read-write memory operation is generated for a given element even if the comparison failed. In this scenario, the value written is the value that was read for the comparison check.

A caller may test if the compare and exchange succeeded by performing a bitwise comparison of the return value with the converted cmp. Equality comparison may behave differently than bitwise comparison for floating point values.

cuda::tiles::atomic_and

template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_and(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_and_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;

Performs elementwise atomic bitwise AND on the memory locations of ptrs.

The values operand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. If mask is present, it undergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.

Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands ptrs, values, and mask when present.

If \(m\) is true or not present, \(p\) is loaded to produce a value \(k\). The bitwise AND between \(v\) and \(k\) is computed as if by invoking ct::operator&(k, v) and the result is stored to \(p\). The corresponding element in the return value is \(k\)

Otherwise, \(m\) is false and no read, computation, or write is performed. The corresponding element in the return value is unspecified.

A call to this API generates a strong read-write memory operation for each address of ptrs which is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined by Order and Scope respectively.

The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is a \(32\) or \(64\) bit integral scalar type.

cuda::tiles::atomic_or

template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_or(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_or_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;

Performs elementwise atomic bitwise OR on the memory locations of ptrs.

The values operand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. If mask is present, it undergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.

Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands ptrs, values, and mask when present.

If \(m\) is true or not present, \(p\) is loaded to produce a value \(k\). The bitwise OR between \(v\) and \(k\) is computed as if by invoking ct::operator|(k, v) and the result is stored to \(p\). The corresponding element in the return value is \(k\)

Otherwise, \(m\) is false and no read, computation, or write is performed. The corresponding element in the return value is unspecified.

A call to this API generates a strong read-write memory operation for each address of ptrs which is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined by Order and Scope respectively.

The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is a \(32\) or \(64\) bit integral scalar type.

cuda::tiles::atomic_xor

template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_xor(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_xor_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;

Performs elementwise atomic bitwise XOR on the memory locations of ptrs.

The values operand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. If mask is present, it undergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.

Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands ptrs, values, and mask when present.

If \(m\) is true or not present, \(p\) is loaded to produce a value \(k\). The bitwise XOR between \(v\) and \(k\) is computed as if by invoking ct::operator^(k, v) and the result is stored to \(p\). The corresponding element in the return value is \(k\)

Otherwise, \(m\) is false and no read, computation, or write is performed. The corresponding element in the return value is unspecified.

A call to this API generates a strong read-write memory operation for each address of ptrs which is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined by Order and Scope respectively.

The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is a \(32\) or \(64\) bit integral scalar type.

cuda::tiles::atomic_max

template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_max(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_max_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;

Performs elementwise atomic maximum on the memory locations of ptrs.

The values operand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. If mask is present, it undergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.

Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands ptrs, values, and mask when present.

If \(m\) is true or not present, \(p\) is loaded to produce a value \(k\). The maximum between \(v\) and \(k\) is computed as if by invoking ct::max(k, v) and the result is stored to \(p\). The corresponding element in the return value is \(k\).

Otherwise, \(m\) is false and no read, computation, or write is performed. The corresponding element in the return value is unspecified.

A call to this API generates a strong read-write memory operation for each address of ptrs which is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined by Order and Scope respectively.

The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is a \(32\) or \(64\) bit integral scalar type.

cuda::tiles::atomic_min

template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_min(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_min_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;

Performs elementwise atomic minimum on the memory locations of ptrs.

The values operand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. If mask is present, it undergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.

Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands ptrs, values, and mask when present.

If \(m\) is true or not present, \(p\) is loaded to produce a value \(k\). The minimum between \(v\) and \(k\) is computed as if by invoking ct::min(k, v) and the result is stored to \(p\). The corresponding element in the return value is \(k\).

Otherwise, \(m\) is false and no read, computation, or write is performed. The corresponding element in the return value is unspecified.

A call to this API generates a strong read-write memory operation for each address of ptrs which is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined by Order and Scope respectively.

The atomic constraint validates that the element type of ct::tile_load_t<Ptrs> is a \(32\) or \(64\) bit integral scalar type.

cuda::tiles::atomic_add

template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_add(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_add_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;

Performs elementwise atomic addition on the memory locations of ptrs.

The values operand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. If mask is present, it undergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.

Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands ptrs, values, and mask when present.

If \(m\) is true or not present, \(p\) is loaded to produce a value \(k\). The sum between \(v\) and \(k\) is computed as if by invoking

  1. If v is a integral scalar: ct::add(k, v). If this produces undefined behavior, the behavior of the operation as a whole is undefined.

  2. If v is a basic floating point scalar:

    ct::add<ct::rounding_mode::round_ties_to_even, SubMode>(k, v)

    where the value of SubMode is not specified.

The result of the computation is stored to \(p\). The corresponding element in the return value is \(k\).

Otherwise, \(m\) is false and no read, computation, or write is performed. The corresponding element in the return value is unspecified.

A call to this API generates a strong read-write memory operation for each address of ptrs which is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined by Order and Scope respectively.

The atomic constraint validates that element type of ct::tile_load_t<Ptrs> is one of

  1. A \(32\) or \(64\) bit integral scalar type

  2. double, float, or __half

cuda::tiles::atomic_sub

template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_sub(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_sub_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;

Performs elementwise atomic subtraction on the memory locations of ptrs.

The values operand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. If mask is present, it undergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.

Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands ptrs, values, and mask when present.

If \(m\) is true or not present, \(p\) is loaded to produce a value \(k\). The difference between \(v\) and \(k\) is computed as if by invoking

  1. If v is a integral scalar: ct::add(k, -v). If this produces undefined behavior, the behavior of the operation as a whole is undefined.

    Note

    The behavior is undefined if \(v\) is a signed integral type and overflow occurs in the unary negation even if the difference between \(k\) and \(v\) is representable in the target integer type.

  2. If v is a basic floating point scalar:

    ct::sub<ct::rounding_mode::round_ties_to_even, SubMode>(k, v)

    where the value of SubMode is not specified.

The result of the computation is stored to \(p\). The corresponding element in the return value is \(k\).

Otherwise, \(m\) is false and no read, computation, or write is performed. The corresponding element in the return value is unspecified.

A call to this API generates a strong read-write memory operation for each address of ptrs which is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined by Order and Scope respectively.

The atomic constraint validates that element type of ct::tile_load_t<Ptrs> is one of

  1. A \(32\) or \(64\) bit integral scalar type

  2. double, float, or __half

cuda::tiles::atomic_xchg

template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values
>
requires /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_xchg(Ptrs ptrs, Values values, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::storeable_tile Ptrs,
ct::non_narrowing_tile_convertible_to<ct::tile_load_t<Ptrs>> Values,
ct::bool_tile_convertible Mask
>
requires ct::broadcastable_to<Mask, ct::tile_shape_t<Ptrs>> && /* atomic constraint */
__tile__ ct::tile_load_t<Ptrs> atomic_xchg_masked(Ptrs ptrs, Values values, Mask mask, ct::memory_order_constant<Order> = {}, ct::thread_scope_constant<Scope> = {}) noexcept;

Performs elementwise atomic exchange on the memory locations of ptrs.

The values operand undergoes tile conversion to the type ct::tile_load_t<Ptrs>. If mask is present, it undergoes bool tile conversion followed by broadcast conversion to the shape of ptrs.

Let \(p\), \(v\), and \(m\) be corresponding elements of the converted operands ptrs, values, and mask when present.

If \(m\) is true or not present, \(p\) is loaded to produce a value \(k\) and \(v\) is subsequently stored to \(p\). The corresponding element in the return value is \(k\).

Otherwise, \(m\) is false and no read, computation, or write is performed. The corresponding element in the return value is unspecified.

A call to this API generates a strong read-write memory operation for each address of ptrs which is not masked. The value read is the value used for the computation and the value written is the result of the computation. The memory order and thread scope of the operation is determined by Order and Scope respectively.

The atomic constraint validates that element type of ct::tile_load_t<Ptrs> is one of

  1. A \(32\) or \(64\) bit integral scalar type

  2. double, or float