Tile Operations

This section documents various operations for manipulating tile like objects.

cuda::tiles::full

template<ct::tile_like T> __tile__ T full(ct::tile_element_t<T> x) noexcept;: Yields a tile like object of type \(T\) whose elements all have value x.

Example

The code ct::full<ct::tile<int, ct::shape<2, 2>>>(42) produces the following matrix:

\[\begin{split}\begin{pmatrix} 42 & 42 \\ 42 & 42 \end{pmatrix}\end{split}\]

cuda::tiles::iota

template<ct::integral_tile T> requires /* atomic constraint */ __tile__ T iota() noexcept;

Yields an integral tile of type \(T\) whose row major arrangement is the sequence \((0, 1, ..., N - 1)\) where \(N\) is the tile size of \(T\).

The atomic constraint validates that \(N - 1\) is representable in the element type of \(T\).

Example

The code ct::iota<ct::tile<int, ct::shape<2, 4>>>() produces the following matrix:

\[\begin{split}\begin{pmatrix} 0 & 1 & 2 & 3 \\ 4 & 5 & 6 & 7 \end{pmatrix}\end{split}\]

cuda::tiles::ones

template<ct::numeric_tile T> __tile__ T ones() noexcept;: Yields a numeric tile object of type \(T\) whose elements all have value \(1\) or true in the case of bool elements.

Example

The code ct::ones<ct::tile<int, ct::shape<2, 2>>>(42) produces the following matrix:

\[\begin{split}\begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}\end{split}\]

cuda::tiles::zeros

template<ct::numeric_tile T> __tile__ T zeros() noexcept;

Yields a numeric tile object of type \(T\) whose elements all have value \(0\) or false in the case of bool elements.

When \(T\) is a basic floating point tile, the sign bit of each element of the result shall be positive.

Example

The code ct::zeros<ct::tile<double, ct::shape<2, 2>>>(42) produces the following matrix:

\[\begin{split}\begin{pmatrix} 0.0 & 0.0 \\ 0.0 & 0.0 \end{pmatrix}\end{split}\]

cuda::tiles::isinf

template<ct::basic_floating_point_tile T> __tile__ ct::tile_with_element_t<T, bool> isinf(T x);

Performs an elementwise check for infinities in x.

For each element \(a\) of x, the result is true if \(a\) is positive or negative infinity and false otherwise.

Example

The following code checks for infinite values in x:

namespace ct = ::cuda::tiles;
using i32x4 = ct::tile<int, ct::shape<4>>;

float xData[4] = {
  1.0  / 0.0,  // positive inf
  -1.0 / 0.0,  // negative inf
  0.0  / 0.0,  // NaN
  1.0,         // finite
};

auto x = ct::load(&xData[0] + ct::iota<i32x4>());
auto r = ct::isinf(x);

\[\begin{split}\begin{pmatrix} \infty \\ -\infty \\ \text{NaN} \\ 1.0 \end{pmatrix} \rightarrow \begin{pmatrix} \text{true} \\ \text{true} \\ \text{false} \\ \text{false} \end{pmatrix}\end{split}\]

cuda::tiles::isnan

template<ct::basic_floating_point_tile T> __tile__ ct::tile_with_element_t<T, bool> isnan(T x);

Performs an elementwise check for NaN values in x.

For each element \(a\) of x, the result is true if \(a\) is not a number and false otherwise.

Example

The following code checks for infinite values in x:

namespace ct = ::cuda::tiles;
using i32x4 = ct::tile<int, ct::shape<4>>;

float xData[4] = {
  1.0  / 0.0,  // positive inf
  -1.0 / 0.0,  // negative inf
  0.0  / 0.0,  // NaN
  1.0,         // finite
};

auto x = ct::load(&xData[0] + ct::iota<i32x4>());
auto r = ct::isnan(x);

\[\begin{split}\begin{pmatrix} \infty \\ -\infty \\ \text{NaN} \\ 1.0 \end{pmatrix} \rightarrow \begin{pmatrix} \text{false} \\ \text{false} \\ \text{true} \\ \text{false} \end{pmatrix}\end{split}\]

cuda::tiles::reshape

template<ct::tile_shape Shape, ct::tile_like Tile> requires (ct::tile_size_v<Tile> == ct::shape_size_v<Shape>) __tile__ ct::tile<ct::tile_element_t<Tile>, remove-cv-t<Shape>> reshape(Tile x, Shape = {}) noexcept;

Reshapes argument x to match the shape Shape.

The result is a ct::tile object of shape Shape whose row major arrangement of elements matches the row major arrangement of x.

Example

The following example reshapes a \(2 \times 4\) matrix into a \(4 \times 2\) matrix:

namespace ct = ::cuda::tiles;
using namespace ct::literals;

auto x = ct::iota<ct::tile<int, ct::shape<2, 4>>>();
auto y = ct::reshape(x, ct::shape{4_ic, 2_ic});

\[\begin{split}\begin{pmatrix} 0 & 1 & 2 & 3 \\ 4 & 5 & 6 & 7 \end{pmatrix} \rightarrow \begin{pmatrix} 0 & 1 \\ 2 & 3 \\ 4 & 5 \\ 6 & 7 \end{pmatrix}\end{split}\]

cuda::tiles::tile_permutation_t

template<ct::tile_like T, typename Map> requires /* atomic constraint */ using tile_permutation_t = /* see below */

Yields the type formed by permuting the dimensions of \(T\) according to the ct::dimension_map Map.

If \(T\) has rank less than \(2\), the result type is \(T\). Otherwise, the result \(U\) is a specialization of ct::tile satisfying the following:

The element type and rank of \(U\) matches that of \(T\).
Let \(P_i\) be Map::mapping(i), denoting the permuted dimension at index \(i\). The length of \(U\) at \(i\) is the length of the permuted dimension of \(T\): \(U_i = T_{P_i}\).

The atomic constraint validates that:

Map is a (possibly cv-qualified) specialization of ct::dimension_map.
The rank of Tile is equal to Map::rank().

Examples

Example 1	Tile	`int`
	Map	`ct::dimension_map<>`
	Result	`int`
Example 2	Tile	`ct::tile<int, ct::shape<>>`
	Map	`ct::dimension_map<>`
	Result	`ct::tile<int, ct::shape<>>`
Example 3	Tile	`ct::tile<int, ct::shape<4, 2, 16, 8>>`
	Map	`ct::dimension_map<2, 1, 3, 0>`
	Result	`ct::tile<int, ct::shape<16, 2, 8, 4>>`

cuda::tiles::permute

template<typename Map, ct::tile_like Tile> requires /* atomic constraint */ __tile__ ct::tile_permutation_t<Tile, Map> permute(Tile in, Map = {}) noexcept;

Yields the permutation of in according to the zero-based permutation vector described by the provided ct::dimension_map.

The atomic constraint validates that:

Map is a specialization of ct::dimension_map
The rank of Tile is Map::rank().

Example

The following code permutes a \(4 \times 2 \times 2\) tile to a \(2 \times 4 \times 2\) tile according to the dimension map ct::dimension_map<2, 0, 1>:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
auto x = ct::iota<ct::tile<int, ct::shape<4, 2, 2>>>();
auto r = ct::permute(x, ct::dimension_map{2_ic, 0_ic, 1_ic});

\[\begin{split}\begin{pmatrix} \begin{pmatrix} 0 & 1 \end{pmatrix} & \begin{pmatrix} 2 & 3 \end{pmatrix} \\ \begin{pmatrix} 4 & 5 \end{pmatrix} & \begin{pmatrix} 6 & 7 \end{pmatrix} \\ \begin{pmatrix} 8 & 9 \end{pmatrix} & \begin{pmatrix} 10 & 11 \end{pmatrix} \\ \begin{pmatrix} 12 & 13 \end{pmatrix} & \begin{pmatrix} 14 & 15 \end{pmatrix} \\ \end{pmatrix} \rightarrow \begin{pmatrix} \begin{pmatrix} 0 & 2 \end{pmatrix} & \begin{pmatrix} 4 & 6 \end{pmatrix} & \begin{pmatrix} 8 & 10 \end{pmatrix} & \begin{pmatrix} 12 & 14 \end{pmatrix} \\ \begin{pmatrix} 1 & 3 \end{pmatrix} & \begin{pmatrix} 5 & 7 \end{pmatrix} & \begin{pmatrix} 9 & 11 \end{pmatrix} & \begin{pmatrix} 13 & 15 \end{pmatrix} \\ \end{pmatrix}\end{split}\]

The behavior of the permutation is described precisely below:

tile projection

Let \(a\) be a tile like object of rank \(N\), shape \(S\) and index space \(I = [0, S_0) \times [0, S_1) \times ... \times [0, S_{N - 1})\).

We may interpret \(a\) as a function \(a : I \rightarrow \mathbb{E}\) describing the association of indices to elements \(\mathbb{E}\) .

The projection of \(a\) along dimension \(0 \leq d < N\) at index \(i \in [0, S_d)\) is a new tile like object of index space \(I_p = [0, S_0) \times [0, S_1) \times ... \times [0, S_{d - 1}] \times [0, 1) \times [0, S_{d + 1}] \times ... \times [0, S_{N - 1}]\):

\[\mathrm{Proj}(a, i, d) : I_p \rightarrow \mathbb{E}\]

whose values correspond to \(a\) when dimension \(d\) is fixed at index \(i\):

\[\mathrm{Proj}(a, i, d)(x_0, x_1, \ldots, x_{d-1}, 0, x_{d+1}, \ldots x_{N-1}) = a(x_0, x_1, ..., x_{d-1}, i, x_d, ..., x_{N-2})\]

iterated tile projection

The iterated projection of \(a\) for indices \((i_0, i_1, ..., i_{N-1}) \in I\) at dimensions \((d_0, d_1, ..., d_n)\) is formed by the repeated application of \(\mathrm{Proj}\):

\[\mathrm{Proj}(... \mathrm{Proj}(\mathrm{Proj}(a, i_0, d_0), i_1, d_1), ..., i_{N-1}, d_{N-1})\]

The iterated projection of \(a\) is a singleton tile whose value is the value of \(a\) at index \(i_0\) along dimension \(d_0\), \(i_1\) along \(d_1\), etc…

tile permutation

Let \(P : [0, N) -> [0, N)\) be the bijective function describing a permutation of dimensions.

The permutation \(r\) of \(a\) is the unique tile-like object whose value at indices \((i_0, i_1, ..., i_{N-1})\) is the iterated projection of \((i_0, i_1, ..., i_{N-1})\) along dimensions \((P(0), P(1), ..., P(N-1))\).

cuda::tiles::tile_transpose_t

template<ct::tile_like T> using tile_transpose_t = /* see below */;

Yields the result type when interchanging the first two dimensions of an object of type \(T\).

If \(T\) has rank less than \(2\), the resulting type is \(T\). Otherwise, the resulting type \(U\) is a specialization of ct::tile satisfying the following:

The element type and rank of \(U\) matches that of \(T\)
The lengths of \(U\) satisfy:
1. \(U_0 = T_1\)
2. \(U_1 = T_0\)
3. \(U_i = T_i\) for each \(2 \leq i < N\) where \(N\) is the rank of \(T\).

Examples

Example 1	Tile	`int`
Example 1	Result	`int`
Example 2	Tile	`ct::tile<int, ct::shape<>>`
Example 2	Result	`ct::tile<int, ct::shape<>>`
Example 3	Tile	`ct::tile<int, ct::shape<4, 2, 16, 8>>`
Example 3	Result	`ct::tile<int, ct::shape<2, 4, 16, 8>>`

cuda::tiles::transpose

template<ct::tile_like Tile> __tile__ ct::tile_transpose_t<Tile> transpose(Tile in) noexcept;

Interchanges the first two dimensions of in.

If the rank of in is less than \(2\), in is returned unmodified. Otherwise, the resulting tile is produced as if by invoking:

ct::permute(src, ct::dimension_map<1, 0, 2, 3, ..., N-1>{});

where \(N\) is the rank of src.

Example

The following code transposes a \(4 \times 2 \times 2\) tile to a \(2 \times 4 \times 2\) tile:

namespace ct = ::cuda::tiles;
auto x = ct::iota<ct::tile<int, ct::shape<4, 2, 2>>>();
auto r = ct::transpose(x);

\[\begin{split}\begin{pmatrix} \begin{pmatrix} 0 & 1 \end{pmatrix} & \begin{pmatrix} 2 & 3 \end{pmatrix} \\ \begin{pmatrix} 4 & 5 \end{pmatrix} & \begin{pmatrix} 6 & 7 \end{pmatrix} \\ \begin{pmatrix} 8 & 9 \end{pmatrix} & \begin{pmatrix} 10 & 11 \end{pmatrix} \\ \begin{pmatrix} 12 & 13 \end{pmatrix} & \begin{pmatrix} 14 & 15 \end{pmatrix} \\ \end{pmatrix} \rightarrow \begin{pmatrix} \begin{pmatrix} 0 & 1 \end{pmatrix} & \begin{pmatrix} 4 & 5 \end{pmatrix} & \begin{pmatrix} 8 & 9 \end{pmatrix} & \begin{pmatrix} 12 & 13 \end{pmatrix} \\ \begin{pmatrix} 2 & 3 \end{pmatrix} & \begin{pmatrix} 6 & 7 \end{pmatrix} & \begin{pmatrix} 10 & 11 \end{pmatrix} & \begin{pmatrix} 14 & 15 \end{pmatrix} \\ \end{pmatrix}\end{split}\]

concatenation_compatible

template<typename T, typename U, size_t D> concept concatenation_compatible = ct::tile_like<T> && ct::tile_like<U> && /* atomic constraint */;

Indicates whether tile like types \(T\) and \(U\) are concatenation compatible along dimension \(D\).

\(T\) and \(U\) are concatenation compatible along \(D\) if:

\(T\) and \(U\) have the same rank and the same element type
\(0 \leq D < N\) where \(N\) is the rank of \(T\) and \(U\)
The lengths satisfy \(T_i = U_i\) for each \(0 \leq i < N, \quad i \neq D\).
The shape \(S\) formed by \(S_i = T_i\) for \(i \neq D\) and \(S_d = T_d + U_d\) is a tile compatible shape.

Note

Rank 0 tiles are never concatenation compatible.

cuda::tiles::concatenation_t

template<ct::tile_like T, ct::tile_like U, size_t Dim> requires ct::concatenation_compatible<T, U, Dim> using concatenation_t = /* see below */;

Yields the result of concatenating tile like types \(T\) and \(U\) along dimension \(D\).

The result type is a specialization of ct::tile whose element type and rank matches that of \(T\) and \(U\) and whose shape \(S\) satisfies:

\(S_i = T_i\) for \(0 \leq i < N, \quad i \neq D\) where \(N\) is the rank of \(T\) and \(U\).
\(S_D = T_D + U_D\)

Example

Example 1	\(T\)	`int`
	\(U\)	`int`
	Dimension	\(0\)
	Result	Incompatible
Example 2	\(T\)	`ct::tile<int, ct::shape<2, 4>`
	\(U\)	`ct::tile<int, ct::shape<2, 4>`
	Dimension	\(0\)
	Result	`ct::tile<int, ct::shape<4, 4>>`
Example 3	\(T\)	`ct::tile<int, ct::shape<2, 4>`
	\(U\)	`ct::tile<int, ct::shape<2, 4>`
	Dimension	\(1\)
	Result	`ct::tile<int, ct::shape<2, 8>>`

cuda::tiles::cat

template<ct::integral auto D, ct::tile_like T, ct::tile_like U> requires (D >= 0) && ct::concatenation_compatible<T, U, D> __tile__ ct::concatenation_t<T, U, D> cat(T x, U y, ct::integral_constant<D> = {}) noexcept;

Concatenates x and y along dimension \(D\).

Let \(N\) be the rank of \(T\) and \(U\). The resulting tile object \(r\) has values defined below

\[\begin{split}r(i_0, i_1, ..., i_D, ..., i_{N-1}) = \begin{cases} x(i_0, i_1, ..., i_D, ..., i_{N-1}) & i_D < T_D \\ y(i_0, i_1, ..., i_D - T_D, ..., i_{N-1}) & \text{otherwise} \end{cases}\end{split}\]

Example

The following code concatenates a matrix of zeros with a matrix of ones along the second dimension.

namespace ct = ::cuda::tiles;
using namespace ct::literals;
auto x = ct::full<ct::tile<int, ct::shape<4, 2>>>(0);
auto y = ct::full<ct::tile<int, ct::shape<4, 2>>>(1);
auto r = ct::cat(x, y, 1_ic);

\[\begin{split}\begin{pmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ 0 & 0 \\ \end{pmatrix} \quad \begin{pmatrix} 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ \end{pmatrix} \rightarrow \begin{pmatrix} 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ \end{pmatrix}\end{split}\]

cuda::tiles::select

template<ct::tile_like T, ct::bool_tile_convertible C> requires ct::broadcastable_to<C, ct::tile_shape_t<T>> __tile__ T select(C condition, T lhs, T rhs) noexcept;

Performs an elementwise selection of the values from lhs where condition is true and rhs where condition is false.

condition undergoes bool tile conversion followed by broadcast conversion to match the shape of \(T\).

Let \(a\), \(b\) and \(c\) denote corresponding elements of the converted lhs, rhs and condition arguments. The result of each selection is \(a\) if \(c\) is true and \(b\) otherwise.

Example

In the following example, the even elements of the result are derived from x while the odd elements are derived from y:

namespace ct = ::cuda::tiles;
using i32x4 = ct::tile<int, ct::shape<4>>;
bool cData[4] = {true, false, true, false};

auto c = ct::load(&cData[0] + ct::iota<i32x4>());
auto x = ct::iota<i32x4>();
auto y = -ct::iota<i32x4>();

auto r = ct::select(c, x, y);

\[\begin{split}\begin{pmatrix} \text{true} \\ \text{false} \\ \text{true} \\ \text{false} \end{pmatrix} \quad \begin{pmatrix} 0 \\ 1 \\ 2 \\ 3 \end{pmatrix} \quad \begin{pmatrix} 0 \\ -1 \\ -2 \\ -3 \end{pmatrix} \rightarrow \begin{pmatrix} 0 \\ -1 \\ 2 \\ -3 \end{pmatrix}\end{split}\]

cuda::tiles::extractable_from

template<typename S, typename T> concept extractable_from = ct::tile_shape<S> && ct::tile_like<T> && /* see below */

Determines if tile compatible shape \(S\) is an extractable shape of \(T\). \(S\) is extractable from \(T\) if:

\(S\) and \(T\) have the same rank denoted \(N\).
For each \(i\) in \(0 \leq i < N\), \(T_i\) is divisible by \(S_i\).

Example

Example 1	\(T\)	`int`
	\(S\)	`ct::shape<>`
	Extractible	Yes
Example 2	\(T\)	`ct::tile<int, ct::shape<32, 8>>`
	\(S\)	`ct::shape<16, 2>`
	Extractible	Yes
Example 3	\(T\)	`ct::tile<int, ct::shape<32, 8>>`
	\(S\)	`ct::shape<2, 16>`
	Extractible	No

cuda::tiles::extract

template<ct::tile_shape S, ct::tile_like T, typename ...Indices> requires ct::extractable_from<S, T> && /* atomic constraint */ __tile__ ct::tile<ct::tile_element_t<T>, S> extract(T x, S, Indices... idx) noexcept;

Extracts a partition of shape \(S\) from tile like object x.

The elements of x are partitioned into equally sized sections of shape \(S\) and each partition is assigned a zero-based index. The idx argument specifies the index of the partition that will be returned.

Let \(N\) be the rank of \(T\) and let \(i_0, i_1, \ldots, i_{N-1}\) be the values of the idx pack after conversion to size_t. The resulting extraction \(r\) has the following value:

\[r(k_0, k_1, \ldots, k_{N-1}) = x(i_0 \cdot S_0 + k_0, i_1 \cdot S_1 + k_1, \ldots, i_{N-1} \cdot S_{N-1} + k_{N-1})\]

If the index \(I = (i_0 \cdot S_0, \ldots i_0 \cdot S_0)\) is not in the index space of x, the behavior is undefined.

The atomic constraint validates that:

The size of the parameter pack Indices is \(N\)
For each type \(U\) in the parameter pack Indices, is-convertible-v<U, size_t> holds.

Example

In the following code, the partition in top right quadrant of the matrix is extracted:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
auto x = ct::iota<ct::tile<int, ct::shape<4, 4>>>();
auto r = ct::extract(x, ct::shape{2_ic, 2_ic}, 0, 1);

\[\begin{split}\begin{pmatrix} 0 & 1 & 2 & 3 \\ 4 & 5 & 6 & 7 \\ 8 & 9 & 10 & 11 \\ 12 & 13 & 14 & 15 \end{pmatrix} \rightarrow \begin{pmatrix} 2 & 3 \\ 6 & 7 \end{pmatrix}\end{split}\]

cuda::tiles::broadcast

template<ct::tile_shape S, ct::tile_like T> requires ct::broadcastable_to<T, S> __tile__ ct::tile<ct::tile_element_t<T>, S> broadcast(T x, S = {}) noexcept;

Yields the broadcast conversion of x to shape \(S\).

Example

The following code broadcasts a \(4 \times 1\) tile to a \(4 \times 4\) tile:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
auto x = ct::iota<ct::tile<int, ct::shape<4, 1>>>();
auto r = ct::broadcast(x, ct::shape{4_ic, 4_ic});

\[\begin{split}\begin{pmatrix} 0 \\ 1 \\ 2 \\ 3 \end{pmatrix} \rightarrow \begin{pmatrix} 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ 2 & 2 & 2 & 2 \\ 3 & 3 & 3 & 3 \end{pmatrix}\end{split}\]

cuda::tiles::element_cast

template<ct::scalar E, ct::tile_like T> requires ct::tile_convertible_to<T, ct::tile_with_element_t<T, E>> __tile__ ct::tile_with_element_t<T, E> element_cast(T x) noexcept;

Yields the tile conversion of x to the type ct::tile_with_element_t<T, E>.

Example

The following code converts a tile of integers to a tile of doubles:

namespace ct = ::cuda::tiles;
auto x = ct::iota<ct::tile<int, ct::shape<4, 1>>>();
auto r = ct::element_cast<double>(x);

\[\begin{split}\begin{pmatrix} 0 \\ 1 \\ 2 \\ 3 \end{pmatrix} \rightarrow \begin{pmatrix} 0.0 \\ 1.0 \\ 2.0 \\ 3.0 \end{pmatrix}\end{split}\]

cuda::tiles::element_bitcast

template<ct::scalar E, ct::tile_like T> requires /* atomic constraint */ __tile__ ct::tile_with_element_t<T, E> element_bitcast(T x) noexcept;

Yields the elementwise bitcast of x to the type \(E\).

For each element \(a\) in x, the result of the computation is produced as if by executing std::bitcast<E>(a) 1.

The atomic constraint validates that the object size of \(E\) is the same as the object size of the element type of \(T\).

Note

The result of this operation depends on the value and object representations of the source and result types, and may generate undefined behavior, see § 22.15.3 of ISO/IEC 14882:2024 for details.

Example

namespace ct = ::cuda::tiles;
auto x = ct::full<ct::tile<unsigned char, ct::shape<4, 1>>>(255);
auto r = ct::element_bitcast<signed char>(x);

\[\begin{split}\begin{pmatrix} 255 \\ 255 \\ 255 \\ 255 \end{pmatrix} \rightarrow \begin{pmatrix} -1 \\ -1 \\ -1 \\ -1 \end{pmatrix}\end{split}\]

Footnotes

1: See Function template bit_cast [bit.cast] § 22.15.3 of ISO/IEC 14882:2024