Partition View

A ct::partition_view is a wrapper around a tensor span like type that divides it into statically sized tiles which may be loaded and stored from memory.

Example

The following code subdivides the provided \(4 \times 8\) tensor span into partitions of size \(2 \times 2\). The partition identified by index \((1, 2)\) is loaded.

namespace ct = ::cuda::tiles;
using namespace ct::literals;
int x[4][8] = {
  {0, 1, 2, 3, 4, 5, 6, 7},
  {8, 9, 10, 11, 12, 13, 14, 15},
  {16, 17, 18, 19, 20, 21, 22, 23},
  {24, 25, 26, 27, 28, 29, 30, 31}
};

ct::tensor_span t{&x[0][0], ct::extents{4_ic, 8_ic}};
ct::partition_view p{t, ct::shape{2_ic, 2_ic}};
auto r = p.load(1, 2);
\[\begin{split}\left( \begin{array}{cc|cc|cc|cc} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\ \hline 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 \\ 24 & 25 & 26 & 27 & 28 & 29 & 30 & 31 \end{array} \right) \rightarrow \begin{pmatrix} 20 & 21 \\ 28 & 29 \end{pmatrix}\end{split}\]

cuda::tiles::partition_view

template<ct::tensor_span_like Span, ct::tile_shape Shape>
requires (Shape::rank() == Span::extents_type::rank())
struct partition_view

All specializations \(T\) of ct::partition_view model std::copyable.

\(T\) satisfies the following constraints if they are satisfied by Span:

  1. std::is_nothrow_move_constructible_v<T>

  2. std::is_nothrow_move_assignable_v<T>

  3. std::is_nothrow_swappable_v<T>

Aliases

using span_type = Span

The type of the wrapped tensor span.

using view_shape_type = Shape

The shape of the tile that will be loaded or stored.

using element_type = typename span_type::element_type
using value_type = typename span_type::value_type
using index_type = typename span_type::index_type
using view_tile_type = ct::tile<value_type, view_shape_type>

Exposition Only Members

span_type __span

Exposition only member containing the wrapped tensor span object. This member exists for the purpose of defining the behavior of the implicitly defined special member functions.

Exposition Only Definitions

partition view mapping

The partition view mapping is a function that associates an index of a partition to potential set of indices of the underlying tensor span which are to be loaded or stored.

Let \(S\) denote view_shape_type and let \(N\) be its rank.

Let \(I = (i_0, i_1, \ldots, i_{N-1})\) be a partition index and \(J = (j_0, j_1, \ldots, j_{N-1})\) be an index in the index space of \(S\).

The mapping is a new index

\[p(I, J) = (i_0 \cdot S_0 + j_0, \ldots, i_{N-1} \cdot S_{N-1} + j_{N-1})\]

Note

The result of the mapping might not be an element in the tensor span’s index space.

partition view index space

The index space of the partition view is the set of partition indices which correspond to in bounds or partially out of bound partitions.

Let \(S\) denote view_shape_type, let \(e\) denote span_type::extents_type and let \(N\) be their rank. The index space is the set of non-negative indices \(I = (i_0, i_1, \ldots, i_{N-1})\) satisfying

\[i_k \cdot S_k < e_k \quad 0 \leq k < N\]

Constructor

__tile__ __host__ __device__ partition_view(span_type span, view_shape_type)

Constructs this object by direct-list-initializing __span from span.

Loads

template<
typename ...Idx
>
requires /* atomic constraint */
__tile__ view_tile_type load(Idx...) const noexcept;
template<
ct::view_padding Pad = ct::default_view_padding(),
typename ...Idx
>
requires /* atomic constraint */
__tile__ view_tile_type load_masked(Idx...) const noexcept;
template<
ct::view_padding Pad = ct::default_view_padding(),
typename ...Idx
>
requires /* atomic constraint */
__tile__ view_tile_type load_masked(ct::view_padding_constant<Pad>, Idx...) const noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
typename ...Idx
>
requires ct::read_memory_order<Order> && /* atomic constraint */
__tile__ view_tile_type atomic_load(Idx...) const noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
typename ...Idx
>
requires ct::read_memory_order<Order> && /* atomic constraint */
__tile__ view_tile_type atomic_load(ct::memory_order_constant<Order>, Idx...) const noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
typename ...Idx
>
requires ct::read_memory_order<Order> && /* atomic constraint */
__tile__ view_tile_type atomic_load(ct::memory_order_constant<Order>, ct::thread_scope_constant<Scope>, Idx...) const noexcept;
template<
ct::view_padding Pad = ct::default_view_padding(),
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
typename ...Idx
>
requires ct::read_memory_order<Order> && /* atomic constraint */
__tile__ view_tile_type atomic_load_masked(Idx...) const noexcept;
template<
ct::view_padding Pad = ct::default_view_padding(),
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
typename ...Idx
>
requires ct::read_memory_order<Order> && /* atomic constraint */
__tile__ view_tile_type atomic_load_masked(ct::view_padding_constant<Pad>, Idx...) const noexcept;
template<
ct::view_padding Pad = ct::default_view_padding(),
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
typename ...Idx
>
requires ct::read_memory_order<Order> && /* atomic constraint */
__tile__ view_tile_type atomic_load_masked(ct::view_padding_constant<Pad>, ct::memory_order_constant<Order>, Idx...) const noexcept;
template<
ct::view_padding Pad = ct::default_view_padding(),
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
typename ...Idx
>
requires ct::read_memory_order<Order> && /* atomic constraint */
__tile__ view_tile_type atomic_load_masked(ct::view_padding_constant<Pad>, ct::memory_order_constant<Order>, ct::thread_scope_constant<Scope>, Idx...) const noexcept;

Loads a tile from the partition specified by the indices Idx.

Example

In the following example, a \(4 \times 11\) tensor span is partitioned into \(2 \times 4\) chunks. A partially out of bounds chunk is loaded using NaN padding values.

namespace ct = ::cuda::tiles;
using namespace ct::literals;
float x[4][11] = {
  {0  , 1  , 2  , 3  , 4  , 5  , 6  , 7  , 8  , 9  , 10},
  {11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21},
  {22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32},
  {33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43}
};

ct::tensor_span t{&x[0][0], ct::extents{4_ic, 11_ic}};
ct::partition_view p{t, ct::shape{2_ic, 4_ic}};

auto r = p.load_masked(ct::view_padding_nan_t{}, 0, 2);
\[\begin{split}\left( \begin{array}{cccc|cccc|ccc} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ 11 & 12 & 13 & 14 & 15 & 16 & 17 & 18 & 19 & 20 & 21 \\ \hline 22 & 23 & 24 & 25 & 26 & 27 & 28 & 29 & 30 & 31 & 32 \\ 33 & 34 & 35 & 36 & 37 & 38 & 39 & 40 & 41 & 42 & 43 \end{array} \right) \rightarrow \begin{pmatrix} 8 & 9 & 10 & \text{NaN} \\ 19 & 20 & 21 & \text{NaN} \end{pmatrix}\end{split}\]

Let \(I\) denote the values specified by the pack Idx and let \(J\) be an index in the index space of view_tile_type.

The value returned by the load is a tile object \(a\) satisfying

\[a(J) = t(p(I, J))\]

where \(p\) is the partition view mapping of this object and \(t\) is the tensor span function of __span.

If the value \(p(I, J)\) is not in index space of __span, the behavior depends on the selected overload:

  • For the non-masked overloads, the behavior is undefined.

  • For masked overloads, the value of \(a(J)\) is the view padding specified by Pad.

The behavior is undefined if any of the following hold:

  1. a value of the Idx pack is not representable in index_type

  2. \(I\) is outside the partition view’s index space

  3. The tensor span function is not injective

An invocation generates a read memory operation at the location \(t(p(I, J))\) for each \(J\) in the index space of __span which is not masked.

For the atomic overloads, the memory operations are strong and have thread scope specified by Scope and memory order specified by Order

The latency and allow_tma optimization hints may appertain to direct call expressions of the above load APIs.

The atomic constraint requires that

  1. The size of the Idx pack matches the rank of view_shape_type.

  2. Each element of the Idx pack scalar convertible to index_type

  3. If Pad is present and its value is not zero view padding, then element_type is a basic floating point scalar.

  4. When specified, the values Pad, Order, and Scope are all enumerators of their respective types.

Note

The indices Idx specify which partition should be loaded. A fully out of bounds partition always yields undefined behavior. A partially out of bounds partition yields undefined behavior for the non-masked variants.

Stores

template<
ct::tile_like Value,
typename ...Idx
>
requires /* atomic constraint */ && ct::non_narrowing_tile_convertible_to<Value, view_tile_type> && ct::storeable_tensor_span<span_type>
__tile__ void store(Value a, Idx...) const noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::tile_like Value,
typename ...Idx
>
requires ct::write_memory_order<Order> && ct::non_narrowing_tile_convertible_to<Value, view_tile_type> && ct::storeable_tensor_span<span_type> && /* atomic constraint */
__tile__ void atomic_store(Value a, Idx...) const noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::tile_like Value,
typename ...Idx
>
requires ct::write_memory_order<Order> && ct::non_narrowing_tile_convertible_to<Value, view_tile_type> && ct::storeable_tensor_span<span_type> && /* atomic constraint*/
__tile__ void atomic_store(Value a, ct::memory_order_constant<Order>, Idx...) const noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::tile_like Value,
typename ...Idx
>
requires ct::write_memory_order<Order> && ct::non_narrowing_tile_convertible_to<Value, view_tile_type> && ct::storeable_tensor_span<span_type> && /* atomic constraint */
__tile__ void atomic_store(Value a, ct::memory_order_constant<Order>, ct::thread_scope_constant<Scope>, Idx...) const noexcept;
template<
ct::tile_like Value,
typename ...Idx
>
requires ct::non_narrowing_tile_convertible_to<Value, view_tile_type> && ct::storeable_tensor_span<span_type> && /* atomic constraint */
__tile__ void store_masked(Value a, Idx...) const noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::tile_like Value,
typename ...Idx
>
requires ct::write_memory_order<Order> && ct::non_narrowing_tile_convertible_to<Value, view_tile_type> && ct::storeable_tensor_span<span_type> && /* atomic constraint */
__tile__ void atomic_store_masked(Value a, Idx...) const noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::tile_like Value,
typename ...Idx
>
requires ct::write_memory_order<Order> && ct::non_narrowing_tile_convertible_to<Value, view_tile_type> && ct::storeable_tensor_span<span_type> && /* atomic constraint */
__tile__ void atomic_store_masked(Value a, ct::memory_order_constant<Order>, Idx...) const noexcept;
template<
ct::memory_order Order,
ct::thread_scope Scope = ct::default_thread_scope(),
ct::tile_like Value,
typename ...Idx
>
requires ct::write_memory_order<Order> && ct::non_narrowing_tile_convertible_to<Value, view_tile_type> && ct::storeable_tensor_span<span_type> && /* atomic constraint */
__tile__ void atomic_store_masked(Value a, ct::memory_order_constant<Order>, ct::thread_scope_constant<Scope>, Idx...) const noexcept;

Stores the tile converted operand a to the partition specified by indices Idx.

Example

In the following example, a \(4 \times 8\) tensor span is partitioned into \(2 \times 2\) chunks and the bottom right partition is updated with a new value.

namespace ct = ::cuda::tiles;
using namespace ct::literals;
int x[4][8] = {
  {0, 1, 2, 3, 4, 5, 6, 7},
  {8, 9, 10, 11, 12, 13, 14, 15},
  {16, 17, 18, 19, 20, 21, 22, 23},
  {24, 25, 26, 27, 28, 29, 30, 31}
};

ct::tensor_span t{&x[0][0], ct::extents{4_ic, 8_ic}};
ct::partition_view p{t, ct::shape{2_ic, 2_ic}};

auto a = 100 * ct::iota<ct::tile<int, ct::shape<2, 2>>>();
p.store(a, 1, 3);
\[\begin{split}\begin{pmatrix} 0 & 100 \\ 200 & 300 \end{pmatrix} \rightarrow \left( \begin{array}{cc|cc|cc|cc} 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\ \hline 16 & 17 & 18 & 19 & 20 & 21 & 0 & 100 \\ 24 & 25 & 26 & 27 & 28 & 29 & 200 & 300 \end{array} \right)\end{split}\]

Let \(I\) denote the values specified by the pack Idx and let \(J\) be an index in the index space of view_tile_type. Let \(a\) denote the value of a after tile conversion to view_tile_type.

The value \(a(J)\) is stored to the memory location

\[t(p(I, J))\]

where \(p\) is the partition view mapping of this object and \(t\) is the tensor span function of __span.

If the value \(p(I, J)\) is not in index space of __span, the behavior depends on the selected overload:

  • For non-masked overloads, the behavior is undefined

  • For masked overloads, no store occurs at that memory location

The behavior is undefined if any of the following hold:

  1. a value of the Idx pack is not representable in index_type

  2. \(I\) is outside the partition view’s index space

  3. The tensor span function is not injective

An invocation generates write memory operations for the values at the addresses \(t(p(I, J))\) for each \(J \in \mathbb{J}\) which is not masked.

For the atomic variants, the memory operations are strong and have thread scope specified by Scope and memory order specified by Order

The latency and allow_tma optimization hints may appertain to direct call expressions of the above store APIs.

The atomic constraint requires that

  1. The size of the Idx pack matches the rank of view_shape_type.

  2. Each element of the Idx pack scalar convertible to index_type

  3. When specified, the values Order, and Scope are all enumerators of their respective types.

span

span_type const &span() const noexcept;

Yields the glvalue __span.