Tile Operations

This section documents various operations for manipulating tile like objects.

cuda::tiles::full

template<ct::tile_like T>
__tile__ T full(ct::tile_element_t<T> x) noexcept;

Yields a tile like object of type \(T\) whose elements all have value x.

Example

The code ct::full<ct::tile<int, ct::shape<2, 2>>>(42) produces the following matrix:

\[\begin{split}\begin{pmatrix} 42 & 42 \\ 42 & 42 \end{pmatrix}\end{split}\]

cuda::tiles::iota

template<ct::integral_tile T>
requires /* atomic constraint */
__tile__ T iota() noexcept;

Yields an integral tile of type \(T\) whose row major arrangement is the sequence \((0, 1, ..., N - 1)\) where \(N\) is the tile size of \(T\).

The atomic constraint validates that \(N - 1\) is representable in the element type of \(T\).

Example

The code ct::iota<ct::tile<int, ct::shape<2, 4>>>() produces the following matrix:

\[\begin{split}\begin{pmatrix} 0 & 1 & 2 & 3 \\ 4 & 5 & 6 & 7 \end{pmatrix}\end{split}\]

cuda::tiles::ones

template<ct::numeric_tile T>
__tile__ T ones() noexcept;

Yields a numeric tile object of type \(T\) whose elements all have value \(1\) or true in the case of bool elements.

Example

The code ct::ones<ct::tile<int, ct::shape<2, 2>>>(42) produces the following matrix:

\[\begin{split}\begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}\end{split}\]

cuda::tiles::zeros

template<ct::numeric_tile T>
__tile__ T zeros() noexcept;

Yields a numeric tile object of type \(T\) whose elements all have value \(0\) or false in the case of bool elements.

When \(T\) is a basic floating point tile, the sign bit of each element of the result shall be positive.

Example

The code ct::zeros<ct::tile<double, ct::shape<2, 2>>>(42) produces the following matrix:

\[\begin{split}\begin{pmatrix} 0.0 & 0.0 \\ 0.0 & 0.0 \end{pmatrix}\end{split}\]

cuda::tiles::isinf

template<ct::basic_floating_point_tile T>
__tile__ ct::tile_with_element_t<T, bool> isinf(T x);

Performs an elementwise check for infinities in x.

For each element \(a\) of x, the result is true if \(a\) is positive or negative infinity and false otherwise.

Example

The following code checks for infinite values in x:

namespace ct = ::cuda::tiles;
using i32x4 = ct::tile<int, ct::shape<4>>;

float xData[4] = {
  1.0  / 0.0,  // positive inf
  -1.0 / 0.0,  // negative inf
  0.0  / 0.0,  // NaN
  1.0,         // finite
};

auto x = ct::load(&xData[0] + ct::iota<i32x4>());
auto r = ct::isinf(x);
\[\begin{split}\begin{pmatrix} \infty \\ -\infty \\ \text{NaN} \\ 1.0 \end{pmatrix} \rightarrow \begin{pmatrix} \text{true} \\ \text{true} \\ \text{false} \\ \text{false} \end{pmatrix}\end{split}\]

cuda::tiles::isnan

template<ct::basic_floating_point_tile T>
__tile__ ct::tile_with_element_t<T, bool> isnan(T x);

Performs an elementwise check for NaN values in x.

For each element \(a\) of x, the result is true if \(a\) is not a number and false otherwise.

Example

The following code checks for infinite values in x:

namespace ct = ::cuda::tiles;
using i32x4 = ct::tile<int, ct::shape<4>>;

float xData[4] = {
  1.0  / 0.0,  // positive inf
  -1.0 / 0.0,  // negative inf
  0.0  / 0.0,  // NaN
  1.0,         // finite
};

auto x = ct::load(&xData[0] + ct::iota<i32x4>());
auto r = ct::isnan(x);
\[\begin{split}\begin{pmatrix} \infty \\ -\infty \\ \text{NaN} \\ 1.0 \end{pmatrix} \rightarrow \begin{pmatrix} \text{false} \\ \text{false} \\ \text{true} \\ \text{false} \end{pmatrix}\end{split}\]

cuda::tiles::reshape

template<ct::tile_shape Shape, ct::tile_like Tile>
requires (ct::tile_size_v<Tile> == ct::shape_size_v<Shape>)
__tile__ ct::tile<ct::tile_element_t<Tile>, remove-cv-t<Shape>> reshape(Tile x, Shape = {}) noexcept;

Reshapes argument x to match the shape Shape.

The result is a ct::tile object of shape Shape whose row major arrangement of elements matches the row major arrangement of x.

Example

The following example reshapes a \(2 \times 4\) matrix into a \(4 \times 2\) matrix:

namespace ct = ::cuda::tiles;
using namespace ct::literals;

auto x = ct::iota<ct::tile<int, ct::shape<2, 4>>>();
auto y = ct::reshape(x, ct::shape{4_ic, 2_ic});
\[\begin{split}\begin{pmatrix} 0 & 1 & 2 & 3 \\ 4 & 5 & 6 & 7 \end{pmatrix} \rightarrow \begin{pmatrix} 0 & 1 \\ 2 & 3 \\ 4 & 5 \\ 6 & 7 \end{pmatrix}\end{split}\]

cuda::tiles::tile_permutation_t

template<ct::tile_like T, typename Map>
requires /* atomic constraint */
using tile_permutation_t = /* see below */

Yields the type formed by permuting the dimensions of \(T\) according to the ct::dimension_map Map.

If \(T\) has rank less than \(2\), the result type is \(T\). Otherwise, the result \(U\) is a specialization of ct::tile satisfying the following:

  1. The element type and rank of \(U\) matches that of \(T\).

  2. Let \(P_i\) be Map::mapping(i), denoting the permuted dimension at index \(i\). The length of \(U\) at \(i\) is the length of the permuted dimension of \(T\): \(U_i = T_{P_i}\).

The atomic constraint validates that:

  1. Map is a (possibly cv-qualified) specialization of ct::dimension_map.

  2. The rank of Tile is equal to Map::rank().

Examples

Example 1

Tile

int

Map

ct::dimension_map<>

Result

int

Example 2

Tile

ct::tile<int, ct::shape<>>

Map

ct::dimension_map<>

Result

ct::tile<int, ct::shape<>>

Example 3

Tile

ct::tile<int, ct::shape<4, 2, 16, 8>>

Map

ct::dimension_map<2, 1, 3, 0>

Result

ct::tile<int, ct::shape<16, 2, 8, 4>>

cuda::tiles::permute

template<typename Map, ct::tile_like Tile>
requires /* atomic constraint */
__tile__ ct::tile_permutation_t<Tile, Map> permute(Tile in, Map = {}) noexcept;

Yields the permutation of in according to the zero-based permutation vector described by the provided ct::dimension_map.

The atomic constraint validates that:

  1. Map is a specialization of ct::dimension_map

  2. The rank of Tile is Map::rank().

Example

The following code permutes a \(4 \times 2 \times 2\) tile to a \(2 \times 4 \times 2\) tile according to the dimension map ct::dimension_map<2, 0, 1>:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
auto x = ct::iota<ct::tile<int, ct::shape<4, 2, 2>>>();
auto r = ct::permute(x, ct::dimension_map{2_ic, 0_ic, 1_ic});
\[\begin{split}\begin{pmatrix} \begin{pmatrix} 0 & 1 \end{pmatrix} & \begin{pmatrix} 2 & 3 \end{pmatrix} \\ \begin{pmatrix} 4 & 5 \end{pmatrix} & \begin{pmatrix} 6 & 7 \end{pmatrix} \\ \begin{pmatrix} 8 & 9 \end{pmatrix} & \begin{pmatrix} 10 & 11 \end{pmatrix} \\ \begin{pmatrix} 12 & 13 \end{pmatrix} & \begin{pmatrix} 14 & 15 \end{pmatrix} \\ \end{pmatrix} \rightarrow \begin{pmatrix} \begin{pmatrix} 0 & 2 \end{pmatrix} & \begin{pmatrix} 4 & 6 \end{pmatrix} & \begin{pmatrix} 8 & 10 \end{pmatrix} & \begin{pmatrix} 12 & 14 \end{pmatrix} \\ \begin{pmatrix} 1 & 3 \end{pmatrix} & \begin{pmatrix} 5 & 7 \end{pmatrix} & \begin{pmatrix} 9 & 11 \end{pmatrix} & \begin{pmatrix} 13 & 15 \end{pmatrix} \\ \end{pmatrix}\end{split}\]

The behavior of the permutation is described precisely below:

tile projection

Let \(a\) be a tile like object of rank \(N\), shape \(S\) and index space \(I = [0, S_0) \times [0, S_1) \times ... \times [0, S_{N - 1})\).

We may interpret \(a\) as a function \(a : I \rightarrow \mathbb{E}\) describing the association of indices to elements \(\mathbb{E}\) .

The projection of \(a\) along dimension \(0 \leq d < N\) at index \(i \in [0, S_d)\) is a new tile like object of index space \(I_p = [0, S_0) \times [0, S_1) \times ... \times [0, S_{d - 1}] \times [0, 1) \times [0, S_{d + 1}] \times ... \times [0, S_{N - 1}]\):

\[\mathrm{Proj}(a, i, d) : I_p \rightarrow \mathbb{E}\]

whose values correspond to \(a\) when dimension \(d\) is fixed at index \(i\):

\[\mathrm{Proj}(a, i, d)(x_0, x_1, \ldots, x_{d-1}, 0, x_{d+1}, \ldots x_{N-1}) = a(x_0, x_1, ..., x_{d-1}, i, x_d, ..., x_{N-2})\]
iterated tile projection

The iterated projection of \(a\) for indices \((i_0, i_1, ..., i_{N-1}) \in I\) at dimensions \((d_0, d_1, ..., d_n)\) is formed by the repeated application of \(\mathrm{Proj}\):

\[\mathrm{Proj}(... \mathrm{Proj}(\mathrm{Proj}(a, i_0, d_0), i_1, d_1), ..., i_{N-1}, d_{N-1})\]

The iterated projection of \(a\) is a singleton tile whose value is the value of \(a\) at index \(i_0\) along dimension \(d_0\), \(i_1\) along \(d_1\), etc…

tile permutation

Let \(P : [0, N) -> [0, N)\) be the bijective function describing a permutation of dimensions.

The permutation \(r\) of \(a\) is the unique tile-like object whose value at indices \((i_0, i_1, ..., i_{N-1})\) is the iterated projection of \((i_0, i_1, ..., i_{N-1})\) along dimensions \((P(0), P(1), ..., P(N-1))\).

cuda::tiles::tile_transpose_t

template<ct::tile_like T>
using tile_transpose_t = /* see below */;

Yields the result type when interchanging the first two dimensions of an object of type \(T\).

If \(T\) has rank less than \(2\), the resulting type is \(T\). Otherwise, the resulting type \(U\) is a specialization of ct::tile satisfying the following:

  1. The element type and rank of \(U\) matches that of \(T\)

  2. The lengths of \(U\) satisfy:

    1. \(U_0 = T_1\)

    2. \(U_1 = T_0\)

    3. \(U_i = T_i\) for each \(2 \leq i < N\) where \(N\) is the rank of \(T\).

Examples

Example 1

Tile

int

Result

int

Example 2

Tile

ct::tile<int, ct::shape<>>

Result

ct::tile<int, ct::shape<>>

Example 3

Tile

ct::tile<int, ct::shape<4, 2, 16, 8>>

Result

ct::tile<int, ct::shape<2, 4, 16, 8>>

cuda::tiles::transpose

template<ct::tile_like Tile>
__tile__ ct::tile_transpose_t<Tile> transpose(Tile in) noexcept;

Interchanges the first two dimensions of in.

If the rank of in is less than \(2\), in is returned unmodified. Otherwise, the resulting tile is produced as if by invoking:

ct::permute(src, ct::dimension_map<1, 0, 2, 3, ..., N-1>{});

where \(N\) is the rank of src.

Example

The following code transposes a \(4 \times 2 \times 2\) tile to a \(2 \times 4 \times 2\) tile:

namespace ct = ::cuda::tiles;
auto x = ct::iota<ct::tile<int, ct::shape<4, 2, 2>>>();
auto r = ct::transpose(x);
\[\begin{split}\begin{pmatrix} \begin{pmatrix} 0 & 1 \end{pmatrix} & \begin{pmatrix} 2 & 3 \end{pmatrix} \\ \begin{pmatrix} 4 & 5 \end{pmatrix} & \begin{pmatrix} 6 & 7 \end{pmatrix} \\ \begin{pmatrix} 8 & 9 \end{pmatrix} & \begin{pmatrix} 10 & 11 \end{pmatrix} \\ \begin{pmatrix} 12 & 13 \end{pmatrix} & \begin{pmatrix} 14 & 15 \end{pmatrix} \\ \end{pmatrix} \rightarrow \begin{pmatrix} \begin{pmatrix} 0 & 1 \end{pmatrix} & \begin{pmatrix} 4 & 5 \end{pmatrix} & \begin{pmatrix} 8 & 9 \end{pmatrix} & \begin{pmatrix} 12 & 13 \end{pmatrix} \\ \begin{pmatrix} 2 & 3 \end{pmatrix} & \begin{pmatrix} 6 & 7 \end{pmatrix} & \begin{pmatrix} 10 & 11 \end{pmatrix} & \begin{pmatrix} 14 & 15 \end{pmatrix} \\ \end{pmatrix}\end{split}\]

concatenation_compatible

template<typename T, typename U, size_t D>
concept concatenation_compatible = ct::tile_like<T> && ct::tile_like<U> && /* atomic constraint */;

Indicates whether tile like types \(T\) and \(U\) are concatenation compatible along dimension \(D\).

\(T\) and \(U\) are concatenation compatible along \(D\) if:

  1. \(T\) and \(U\) have the same rank and the same element type

  2. \(0 \leq D < N\) where \(N\) is the rank of \(T\) and \(U\)

  3. The lengths satisfy \(T_i = U_i\) for each \(0 \leq i < N, \quad i \neq D\).

  4. The shape \(S\) formed by \(S_i = T_i\) for \(i \neq D\) and \(S_d = T_d + U_d\) is a tile compatible shape.

Note

Rank 0 tiles are never concatenation compatible.

cuda::tiles::concatenation_t

template<ct::tile_like T, ct::tile_like U, size_t Dim>
requires ct::concatenation_compatible<T, U, Dim>
using concatenation_t = /* see below */;

Yields the result of concatenating tile like types \(T\) and \(U\) along dimension \(D\).

The result type is a specialization of ct::tile whose element type and rank matches that of \(T\) and \(U\) and whose shape \(S\) satisfies:

  1. \(S_i = T_i\) for \(0 \leq i < N, \quad i \neq D\) where \(N\) is the rank of \(T\) and \(U\).

  2. \(S_D = T_D + U_D\)

Example

Example 1

\(T\)

int

\(U\)

int

Dimension

\(0\)

Result

Incompatible

Example 2

\(T\)

ct::tile<int, ct::shape<2, 4>

\(U\)

ct::tile<int, ct::shape<2, 4>

Dimension

\(0\)

Result

ct::tile<int, ct::shape<4, 4>>

Example 3

\(T\)

ct::tile<int, ct::shape<2, 4>

\(U\)

ct::tile<int, ct::shape<2, 4>

Dimension

\(1\)

Result

ct::tile<int, ct::shape<2, 8>>

cuda::tiles::cat

template<ct::integral auto D, ct::tile_like T, ct::tile_like U>
requires (D >= 0) && ct::concatenation_compatible<T, U, D>
__tile__ ct::concatenation_t<T, U, D> cat(T x, U y, ct::integral_constant<D> = {}) noexcept;

Concatenates x and y along dimension \(D\).

Let \(N\) be the rank of \(T\) and \(U\). The resulting tile object \(r\) has values defined below

\[\begin{split}r(i_0, i_1, ..., i_D, ..., i_{N-1}) = \begin{cases} x(i_0, i_1, ..., i_D, ..., i_{N-1}) & i_D < T_D \\ y(i_0, i_1, ..., i_D - T_D, ..., i_{N-1}) & \text{otherwise} \end{cases}\end{split}\]

Example

The following code concatenates a matrix of zeros with a matrix of ones along the second dimension.

namespace ct = ::cuda::tiles;
using namespace ct::literals;
auto x = ct::full<ct::tile<int, ct::shape<4, 2>>>(0);
auto y = ct::full<ct::tile<int, ct::shape<4, 2>>>(1);
auto r = ct::cat(x, y, 1_ic);
\[\begin{split}\begin{pmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ \end{pmatrix} \quad \begin{pmatrix} 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ \end{pmatrix} \rightarrow \begin{pmatrix} 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ \end{pmatrix}\end{split}\]

cuda::tiles::select

template<ct::tile_like T, ct::bool_tile_convertible C>
requires ct::broadcastable_to<C, ct::tile_shape_t<T>>
__tile__ T select(C condition, T lhs, T rhs) noexcept;

Performs an elementwise selection of the values from lhs where condition is true and rhs where condition is false.

condition undergoes bool tile conversion followed by broadcast conversion to match the shape of \(T\).

Let \(a\), \(b\) and \(c\) denote corresponding elements of the converted lhs, rhs and condition arguments. The result of each selection is \(a\) if \(c\) is true and \(b\) otherwise.

Example

In the following example, the even elements of the result are derived from x while the odd elements are derived from y:

namespace ct = ::cuda::tiles;
using i32x4 = ct::tile<int, ct::shape<4>>;
bool cData[4] = {true, false, true, false};

auto c = ct::load(&cData[0] + ct::iota<i32x4>());
auto x = ct::iota<i32x4>();
auto y = -ct::iota<i32x4>();

auto r = ct::select(c, x, y);
\[\begin{split}\begin{pmatrix} \text{true} \\ \text{false} \\ \text{true} \\ \text{false} \end{pmatrix} \quad \begin{pmatrix} 0 \\ 1 \\ 2 \\ 3 \end{pmatrix} \quad \begin{pmatrix} 0 \\ -1 \\ -2 \\ -3 \end{pmatrix} \rightarrow \begin{pmatrix} 0 \\ -1 \\ 2 \\ -3 \end{pmatrix}\end{split}\]

cuda::tiles::extractable_from

template<typename S, typename T>
concept extractable_from = ct::tile_shape<S> && ct::tile_like<T> && /* see below */

Determines if tile compatible shape \(S\) is an extractable shape of \(T\). \(S\) is extractable from \(T\) if:

  1. \(S\) and \(T\) have the same rank denoted \(N\).

  2. For each \(i\) in \(0 \leq i < N\), \(T_i\) is divisible by \(S_i\).

Example

Example 1

\(T\)

int

\(S\)

ct::shape<>

Extractible

Yes

Example 2

\(T\)

ct::tile<int, ct::shape<32, 8>>

\(S\)

ct::shape<16, 2>

Extractible

Yes

Example 3

\(T\)

ct::tile<int, ct::shape<32, 8>>

\(S\)

ct::shape<2, 16>

Extractible

No

cuda::tiles::extract

template<ct::tile_shape S, ct::tile_like T, typename ...Indices>
requires ct::extractable_from<S, T> && /* atomic constraint */
__tile__ ct::tile<ct::tile_element_t<T>, S> extract(T x, S, Indices... idx) noexcept;

Extracts a partition of shape \(S\) from tile like object x.

The elements of x are partitioned into equally sized sections of shape \(S\) and each partition is assigned a zero-based index. The idx argument specifies the index of the partition that will be returned.

Let \(N\) be the rank of \(T\) and let \(i_0, i_1, \ldots, i_{N-1}\) be the values of the idx pack after conversion to size_t. The resulting extraction \(r\) has the following value:

\[r(k_0, k_1, \ldots, k_{N-1}) = x(i_0 \cdot S_0 + k_0, i_1 \cdot S_1 + k_1, \ldots, i_{N-1} \cdot S_{N-1} + k_{N-1})\]

If the index \(I = (i_0 \cdot S_0, \ldots i_0 \cdot S_0)\) is not in the index space of x, the behavior is undefined.

The atomic constraint validates that:

  1. The size of the parameter pack Indices is \(N\)

  2. For each type \(U\) in the parameter pack Indices, is-convertible-v<U, size_t> holds.

Example

In the following code, the partition in top right quadrant of the matrix is extracted:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
auto x = ct::iota<ct::tile<int, ct::shape<4, 4>>>();
auto r = ct::extract(x, ct::shape{2_ic, 2_ic}, 0, 1);
\[\begin{split}\begin{pmatrix} 0 & 1 & 2 & 3 \\ 4 & 5 & 6 & 7 \\ 8 & 9 & 10 & 11 \\ 12 & 13 & 14 & 15 \end{pmatrix} \rightarrow \begin{pmatrix} 2 & 3 \\ 6 & 7 \end{pmatrix}\end{split}\]

cuda::tiles::broadcast

template<ct::tile_shape S, ct::tile_like T>
requires ct::broadcastable_to<T, S>
__tile__ ct::tile<ct::tile_element_t<T>, S> broadcast(T x, S = {}) noexcept;

Yields the broadcast conversion of x to shape \(S\).

Example

The following code broadcasts a \(4 \times 1\) tile to a \(4 \times 4\) tile:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
auto x = ct::iota<ct::tile<int, ct::shape<4, 1>>>();
auto r = ct::broadcast(x, ct::shape{4_ic, 4_ic});
\[\begin{split}\begin{pmatrix} 0 \\ 1 \\ 2 \\ 3 \end{pmatrix} \rightarrow \begin{pmatrix} 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ 2 & 2 & 2 & 2 \\ 3 & 3 & 3 & 3 \end{pmatrix}\end{split}\]

cuda::tiles::element_cast

template<ct::scalar E, ct::tile_like T>
requires ct::tile_convertible_to<T, ct::tile_with_element_t<T, E>>
__tile__ ct::tile_with_element_t<T, E> element_cast(T x) noexcept;

Yields the tile conversion of x to the type ct::tile_with_element_t<T, E>.

Example

The following code converts a tile of integers to a tile of doubles:

namespace ct = ::cuda::tiles;
auto x = ct::iota<ct::tile<int, ct::shape<4, 1>>>();
auto r = ct::element_cast<double>(x);
\[\begin{split}\begin{pmatrix} 0 \\ 1 \\ 2 \\ 3 \end{pmatrix} \rightarrow \begin{pmatrix} 0.0 \\ 1.0 \\ 2.0 \\ 3.0 \end{pmatrix}\end{split}\]

cuda::tiles::element_bitcast

template<ct::scalar E, ct::tile_like T>
requires /* atomic constraint */
__tile__ ct::tile_with_element_t<T, E> element_bitcast(T x) noexcept;

Yields the elementwise bitcast of x to the type \(E\).

For each element \(a\) in x, the result of the computation is produced as if by executing std::bitcast<E>(a) 1.

The atomic constraint validates that the object size of \(E\) is the same as the object size of the element type of \(T\).

Note

The result of this operation depends on the value and object representations of the source and result types, and may generate undefined behavior, see § 22.15.3 of ISO/IEC 14882:2024 for details.

Example

namespace ct = ::cuda::tiles;
auto x = ct::full<ct::tile<unsigned char, ct::shape<4, 1>>>(255);
auto r = ct::element_bitcast<signed char>(x);
\[\begin{split}\begin{pmatrix} 255 \\ 255 \\ 255 \\ 255 \end{pmatrix} \rightarrow \begin{pmatrix} -1 \\ -1 \\ -1 \\ -1 \end{pmatrix}\end{split}\]

Footnotes

1

See Function template bit_cast [bit.cast] § 22.15.3 of ISO/IEC 14882:2024