Reductions and Scans

A reduction on a tile like object along a dimension is the result of repeatedly accumulating the elements of that dimension into a single value according to a binary operation. The resulting tile contains the accumulated result of each vector along the operating dimension. A scan is similar to a reduction except it preserves partial accumulation results in the final tile.

Definitions

In the following definitions, let \(a\) be a tile like object of rank \(N\), shape \(S\), and element type \(E\). Let \(0 \leq d < N\) be a dimension of \(a\) and \(\operatorname{op} : E \times E \rightarrow E\) denote a binary operation with identity element \(\mathrm{id}\) of scalar type \(E\).

tile reduction

A reduction of \(a\) along \(d\) for operation \(\operatorname{op}\) is any tile like object \(r\) satisfying the following constraints.

The shape \(R\) of \(r\) matches \(S\) except at dimension \(d\) where its length is \(1\):

\[\begin{split}R_k = \begin{cases} S_k & k \neq d \\ 1 & k = d \end{cases}\end{split}\]

Let

\[I = (i_0, i_1, ..., i_{d-1}, 1, i_{d+1}, \ldots , i_{N-1})\]

be an element in the index space of \(r\). Consider a collection consisting of the elements of \(a\) along \(d\) at \(I\):

\[V = \left \{ a( i_0, i_1, ..., i_{d-1}, j, i_{d+1}, \ldots , i_{N-1}) \quad | \quad 0 \leq j < S_d \right \}\]

Let \(v = (v_0, v_1, \ldots, v_k)\) be any sequence consisting of the elements of \(V\) and \(0\) or more instances of \(\mathrm{id}\).

The value of \(r\) at \(I\) is any application of \(\operatorname{op}\) to the sequence \(v\):

\[r(I) = v_0 \operatorname{op} v_1 \operatorname{op} \ldots \operatorname{op} v_k\]

where the grouping of the application of \(\operatorname{op}\) and the ordering of the elements of \(v\) is unspecified.

If \(\operatorname{op}\) would incur undefined behavior on any possible grouping or ordering of the elements of \(v\), the computation of the reduction as a whole is undefined behavior.

Note

If \(\operatorname{op}\) is not commutative or not associative or if \(\mathrm{id}\) is not a true identity element, there may be multiple possible reductions of \(a\).

If a particular grouping or ordering of \(v\) is observed at index \(I\) in the result, this does not imply any constraint on the grouping or ordering of \(v\) for any other index of \(r\).

tile scan

A scan of \(a\) along \(d\) for operation \(\operatorname{op}\) is any tile like object \(r\) of shape \(S\) satisfying the following constraints:

Let

\[I = (i_0, i_1, \ldots, i_{d-1}, i_d, i_{d+1}, \ldots, i_{N-1})\]

be an element in the index space of \(r\).

Consider a collection consisting of the elements of \(a\) up to index \(i_d\) along \(d\):

\[V = \left \{ a( i_0, i_1, ..., i_{d-1}, j, i_{d+1}, ..., i_{N-1}) \quad | \quad 0 <= j <= i_d \right \}\]

Let \(v = (v_0, v_1, \ldots, v_k)\) be any sequence consisting of the elements of \(V\) and \(0\) or more instances of \(\mathrm{id}\).

The value of \(r\) at \(I\) is any application of \(\operatorname{op}\) to the sequence \(v\)

\[r(I) = v_0 \operatorname{op} v_1 \operatorname{op} \ldots \operatorname{op} v_k\]

where the grouping of the application of \(\operatorname{op}\) and the ordering of the elements of \(v\) is unspecified.

If \(\operatorname{op}\) would incur undefined behavior on any possible grouping or ordering of the elements of \(v\), the computation of the scan as a whole is undefined behavior.

Note

If \(\operatorname{op}\) is not commutative or not associative or if \(\mathrm{id}\) is not a true identity element, there may be multiple possible scans for \(a\).

If a particular grouping or ordering of \(v\) is observed at index \(I\) in the result, this does not imply any constraint on the grouping or ordering of \(v\) for any other index of \(r\).

cuda::tiles::reduction_result_t

template<ct::tile_like T, size_t D>
requires (D < ct::tile_rank_v<T>)
using reduction_result_t = /* see below */;

Yields the result type of applying a reduction to an object of type \(T\) along dimension \(D\).

cuda::tiles::reduce_max

template<
ct::integral auto D,
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>)
__tile__ ct::reduction_result_t<T, D> reduce_max(T a, ct::integral_constant<D> = {}) noexcept;
template<
ct::integral auto D,
ct::nan_propagation_mode Nan = ct::default_nan_propagation_mode(),
ct::subnormals_rounding_mode SubMode = ct::default_subnormals_rounding_mode(),
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>) && /* atomic constraint */
__tile__ ct::reduction_result_t<T, D> reduce_max(T a, ct::integral_constant<D>, ct::nan_propagation_mode_constant<Nan>, ct::subnormals_rounding_mode_constant<SubMode> = {}) noexcept;

Yields an unspecified reduction of operand a along dimension D according to binary operator \(\operatorname{op}(x, y)\) defined as:

Overload 1: ct::max(x, y)

Overload 2: ct::max<Nan, SubMode>(x, y)

Let \(E\) be the element type of \(T\).

If \(E\) is an integral scalar. the identity element \(\mathrm{id}\) is the minimal representable value of \(E\).

If \(E\) is a basic floating point scalar. the identity element \(\mathrm{id}\) is negative infinity.

The atomic constraint of overload (2) validates that

  1. \(T\) is a basic floating point scalar

  2. If SubMode is round subnormals to zero, then \(T\) is float.

  3. The values NanMode and SubMode are enumerators of their respective types.

Note

The identity element for floating point arguments \(\mathrm{id} = -\infty\) may not be a true identity of \(\operatorname{op}\) depending on the provided numeric flags. For example, when suppress NaN (the default) is used:

\[\operatorname{op}(-\infty, \text{NaN}) = -\infty\]

In this scenario, multiple valid reduction results are possible when any element of the tile is a \(\text{NaN}\).

Example

The following example performs a max reduction:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
using i32x2x4 = ct::tile<int, ct::shape<2, 4>>;

float xData[2][4] = {
  {0, 10, 2, 5},
  {-3, 2, 22, 7},
};

auto x = ct::load(&xData[0][0] + ct::iota<i32x2x4>());

auto r0 = ct::reduce_max(x, 1_ic); // No flags

auto r1 = ct::reduce_max(x, 1_ic,  // With flags
                         ct::suppress_nan_t{},
                         ct::preserve_subnormals_t{});
\[\begin{split}\begin{pmatrix} 0 & 10 & 2 & 5 \\ -3 & 2 & 22 & 7 \end{pmatrix} \rightarrow \begin{pmatrix} 10 \\ 22 \end{pmatrix}\end{split}\]

cuda::tiles::reduce_min

template<
ct::integral auto D,
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>)
__tile__ ct::reduction_result_t<T, D> reduce_min(T a, ct::integral_constant<D> = {}) noexcept;
template<
ct::integral auto D,
ct::nan_propagation_mode Nan = ct::default_nan_propagation_mode(),
ct::subnormals_rounding_mode SubMode = ct::default_subnormals_rounding_mode(),
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>) && /* atomic constraint */
__tile__ ct::reduction_result_t<T, D> reduce_min(T a, ct::integral_constant<D>, ct::nan_propagation_mode_constant<Nan>, ct::subnormals_rounding_mode_constant<SubMode> = {}) noexcept;

Yields an unspecified reduction of operand a along dimension D according to binary operator \(\operatorname{op}(x, y)\) defined as:

Overload 1: ct::min(x, y)

Overload 2: ct::min<Nan, SubMode>(x, y)

Let \(E\) be the element type of \(T\).

If \(E\) is an integral scalar. the identity element \(\mathrm{id}\) is the maximal representable value of \(E\).

If \(E\) is a basic floating point scalar. the identity element \(\mathrm{id}\) is positive infinity.

The atomic constraint of overload (2) validates that

  1. \(T\) is a basic floating point scalar

  2. If SubMode is round subnormals to zero, then \(T\) is float.

  3. The values NanMode and SubMode are enumerators of their respective types.

Note

The identity element for floating point arguments \(\mathrm{id} = \infty\) may not be a true identity of \(\operatorname{op}\) depending on the provided numeric flags. For example, when suppress NaN (the default) is used:

\[\operatorname{op}(\infty, \text{NaN}) = \infty\]

In this scenario, multiple valid reduction results are possible when any element of the tile is a \(\text{NaN}\).

Example

The following example performs a min reduction:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
using i32x2x4 = ct::tile<int, ct::shape<2, 4>>;

float xData[2][4] = {
  {0, 10, 2, 5},
  {-3, 2, 22, 7},
};

auto x = ct::load(&xData[0][0] + ct::iota<i32x2x4>());

auto r0 = ct::reduce_min(x, 0_ic); // No flags

auto r1 = ct::reduce_min(x, 0_ic,  // With flags
                         ct::suppress_nan_t{},
                         ct::preserve_subnormals_t{});
\[\begin{split}\begin{pmatrix} 0 & 10 & 2 & 5 \\ -3 & 2 & 22 & 7 \end{pmatrix} \rightarrow \begin{pmatrix} -3 & 2 & 2 & 5 \end{pmatrix}\end{split}\]

cuda::tiles::all_of

template<ct::integral auto D, ct::bool_tile_convertible T>
requires (D >= 0) && (D < tile_rank_v<T>)
__tile__ ct::reduction_result_t<ct::tile_with_element_t<T, bool>, D> all_of(T a, ct::integral_constant<D> = {}) noexcept;

Yields the logical AND reduction of the bool tile converted operand a along dimension \(D\). The reduction operator \(\operatorname{op}(x, y)\) is the expression x && y and the identity element is true.

Example

namespace ct = ::cuda::tiles;
using namespace ct::literals;
using i32x2x4 = ct::tile<int, ct::shape<2, 4>>;

bool xData[2][4] = {
  {true, true, false, false},
  {true, false, true, false},
};

auto x = ct::load(&xData[0][0] + ct::iota<i32x2x4>());

auto r = ct::all_of(x, 0_ic);
\[\begin{split}\begin{pmatrix} \text{true} & \text{true} & \text{false} & \text{false} \\ \text{true} & \text{false} & \text{true} & \text{false} \end{pmatrix} \rightarrow \begin{pmatrix} \text{true} & \text{false} & \text{false} & \text{false} \end{pmatrix}\end{split}\]

cuda::tiles::any_of

template<ct::integral auto D, ct::bool_tile_convertible T>
requires (D >= 0) && (D < tile_rank_v<T>)
__tile__ ct::reduction_result_t<ct::tile_with_element_t<T, bool>, D> any_of(T a, ct::integral_constant<D> = {}) noexcept;

Yields the logical OR reduction of the bool tile converted operand a along dimension \(D\). The reduction operator \(\operatorname{op}(x, y)\) is the expression x || y and the identity element is false.

Example

namespace ct = ::cuda::tiles;
using namespace ct::literals;
using i32x2x4 = ct::tile<int, ct::shape<2, 4>>;

bool xData[2][4] = {
  {true, true, false, false},
  {true, false, true, false},
};

auto x = ct::load(&xData[0][0] + ct::iota<i32x2x4>());

auto r = ct::all_of(x, 0_ic);
\[\begin{split}\begin{pmatrix} \text{true} & \text{true} & \text{false} & \text{false} \\ \text{true} & \text{false} & \text{true} & \text{false} \end{pmatrix} \rightarrow \begin{pmatrix} \text{true} & \text{true} & \text{true} & \text{false} \end{pmatrix}\end{split}\]

cuda::tiles::sum

template<
ct::integral auto D,
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>)
__tile__ ct::reduction_result_t<T, D> sum(T a, ct::integral_constant<D> = {}) noexcept;
template<
ct::integral auto D,
ct::rounding_mode Mode = ct::default_rounding_mode(),
ct::subnormals_rounding_mode SubMode = ct::default_subnormals_rounding_mode(),
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>) && /* atomic constraint */
__tile__ ct::reduction_result_t<T, D> sum(T a, ct::integral_constant<D>, ct::rounding_mode_constant<Mode>, ct::subnormals_rounding_mode_constant<SubMode> = {}) noexcept;

Yields an unspecified summation reduction of operand a along dimension \(D\). The binary operation \(\operatorname{op}(x, y)\) is defined as

Overload 1: ct::add(x, y)

Overload 2: ct::add<Mode, SubMode>(x, y)

The identity element \(\mathrm{id}\) is \(0\) for integral scalar elements and positive zero for floating point elements of a.

The atomic constraint of overload (2) validates that:

  1. \(T\) is a basic floating point scalar

  2. Mode is a precise rounding mode

  3. If SubMode is round subnormals to zero, then \(T\) is float.

  4. The values Mode and SubMode are enumerators of their respective types.

Note

The binary operation \(\operatorname{op}\) is not associative for floating point arguments. There may be multiple valid results of the reduction.

Undefined behavior may occur if any ordering or grouping of elements triggers signed integer overflow. See undefined behavior annex for an example.

Example

The following example sums the rows of a tile:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
using i32x2x4 = ct::tile<int, ct::shape<2, 4>>;

float xData[2][4] = {
  {3, 2, 1, 4},
  {-3, 2, 1, 5},
};

auto x = ct::load(&xData[0][0] + ct::iota<i32x2x4>());

auto r0 = ct::sum(x, 1_ic); // No flags

auto r1 = ct::sum(x, 1_ic,  // With flags
                  ct::round_ties_to_even_t{},
                  ct::preserve_subnormals_t{});
\[\begin{split}\begin{pmatrix} 3 & 2 & 1 & 4 \\ -3 & 2 & 1 & 5 \\ \end{pmatrix} \rightarrow \begin{pmatrix} 10 \\ 5 \end{pmatrix}\end{split}\]

cuda::tiles::prod

template<
ct::integral auto D,
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>)
__tile__ ct::reduction_result_t<T, D> prod(T a, ct::integral_constant<D> = {}) noexcept;
template<
ct::integral auto D,
ct::rounding_mode Mode = ct::default_rounding_mode(),
ct::subnormals_rounding_mode SubMode = ct::default_subnormals_rounding_mode(),
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>) && /* atomic constraint */
__tile__ ct::reduction_result_t<T, D> prod(T a, ct::integral_constant<D>, ct::rounding_mode_constant<Mode>, ct::subnormals_rounding_mode_constant<SubMode> = {}) noexcept;

Yields an unspecified product reduction of operand a along dimension \(D\). The binary operation \(\operatorname{op}(x, y)\) is defined as

Overload 1: ct::mul(x, y)

Overload 2: ct::mul<Mode, SubMode>(x, y)

The identity element \(\mathrm{id}\) is \(1\).

The atomic constraint of overload (2) validates that:

  1. \(T\) is a basic floating point scalar

  2. Mode is a precise rounding mode

  3. If SubMode is round subnormals to zero, then \(T\) is float.

  4. The values Mode and SubMode are enumerators of their respective types.

Note

The binary operation \(\operatorname{op}\) is not associative for floating point arguments. There may be multiple valid results of the reduction.

Undefined behavior may occur if any ordering or grouping of elements triggers signed integer overflow. See the undefined behavior annex for an example.

Example

The following example performs a product along the rows of a tile:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
using i32x2x4 = ct::tile<int, ct::shape<2, 4>>;

float xData[2][4] = {
  {3, 2, 1, 4},
  {-3, 2, 1, 5},
};

auto x = ct::load(&xData[0][0] + ct::iota<i32x2x4>());

auto r0 = ct::prod(x, 1_ic); // No flags

auto r1 = ct::prod(x, 1_ic,  // With flags
                   ct::round_ties_to_even_t{},
                   ct::preserve_subnormals_t{});
\[\begin{split}\begin{pmatrix} 3 & 2 & 1 & 4 \\ -3 & 2 & 1 & 5 \\ \end{pmatrix} \rightarrow \begin{pmatrix} 24 \\ -30 \end{pmatrix}\end{split}\]

cuda::tiles::reduce_bitand

template<
ct::integral auto D,
ct::integral_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>)
__tile__ ct::reduction_result_t<T, D> reduce_bitand(T a, ct::integral_constant<D> = {}) noexcept;

Yields the unique bitwise AND reduction of the operand a along dimension \(D\).

The binary operator \(\operatorname{op}(x, y)\) is the expression ct::operator&(x, y).

Let \(E\) be the element type of a. If \(E\) is a signed integral scalar, the identity element \(\mathrm{id}\) is \(-1\). If \(E\) is an unsigned integral scalar the identity element is the maximum integer value representable in \(E\).

Example

namespace ct = ::cuda::tiles;
using namespace ct::literals;
using i32x2x4 = ct::tile<int, ct::shape<2, 4>>;

unsigned char xData[2][4] = {
  {0b00001111, 0b10101010},
  {0b01010101, 0b11110000},
};

auto x = ct::load(&xData[0][0] + ct::iota<i32x2x4>());

auto r = ct::reduce_bitand(x, 0_ic);
\[\begin{split}\begin{pmatrix} \text{0b00001111} & \text{0b10101010} \\ \text{0b01010101} & \text{0b11110000} \end{pmatrix} \rightarrow \begin{pmatrix} \text{0b00000101} & \text{0b10100000} \end{pmatrix}\end{split}\]

cuda::tiles::reduce_bitor

template<ct::integral auto D, ct::integral_tile T>
requires (D >= 0) && (D < tile_rank_v<T>)
__tile__ ct::reduction_result_t<T, D> reduce_bitor(T a, ct::integral_constant<D> = {}) noexcept;

Yields the unique bitwise OR reduction of the operand a along dimension \(D\).

The binary operator \(\operatorname{op}(x, y)\) is the expression ct::operator|(x, y).

The identity element \(\mathrm{id}\) is \(0\).

Example

namespace ct = ::cuda::tiles;
using namespace ct::literals;
using i32x2x4 = ct::tile<int, ct::shape<2, 4>>;

unsigned char xData[2][4] = {
  {0b00001111, 0b10101010},
  {0b01010101, 0b11110000},
};

auto x = ct::load(&xData[0][0] + ct::iota<i32x2x4>());

auto r = ct::reduce_bitor(x, 0_ic);
\[\begin{split}\begin{pmatrix} \text{0b00001111} & \text{0b10101010} \\ \text{0b01010101} & \text{0b11110000} \end{pmatrix} \rightarrow \begin{pmatrix} \text{0b01011111} & \text{0b11111010} \end{pmatrix}\end{split}\]

cuda::tiles::reduce_bitxor

template<ct::integral auto D, ct::integral_tile T>
requires (D >= 0) && (D < tile_rank_v<T>)
__tile__ ct::reduction_result_t<T, D> reduce_bitxor(T a, ct::integral_constant<D> = {}) noexcept;

Yields the unique bitwise XOR reduction of the operand a along dimension \(D\).

The binary operator \(\operatorname{op}(x, y)\) is the expression ct::operator^(x, y).

The identity element \(\mathrm{id}\) is \(0\).

Example

namespace ct = ::cuda::tiles;
using namespace ct::literals;
using i32x2x4 = ct::tile<int, ct::shape<2, 4>>;

unsigned char xData[2][4] = {
  {0b00001111, 0b10101010},
  {0b01010101, 0b11110000},
};

auto x = ct::load(&xData[0][0] + ct::iota<i32x2x4>());

auto r = ct::reduce_bitxor(x, 0_ic);
\[\begin{split}\begin{pmatrix} \text{0b00001111} & \text{0b10101010} \\ \text{0b01010101} & \text{0b11110000} \end{pmatrix} \rightarrow \begin{pmatrix} \text{0b01011010} & \text{0b01011010} \end{pmatrix}\end{split}\]

cuda::tiles::partial_sum

template<
ct::integral auto D,
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>)
__tile__ T partial_sum(T a, ct::integral_constant<D> = {}) noexcept;
template<
ct::integral auto D,
ct::rounding_mode Mode = ct::default_rounding_mode(),
ct::subnormals_rounding_mode SubMode = ct::default_subnormals_rounding_mode(),
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>) && /* atomic constraint */
__tile__ T partial_sum(T a, ct::integral_constant<D>, ct::rounding_mode_constant<Mode>, ct::subnormals_rounding_mode_constant<SubMode> = {}) noexcept;

Yields an unspecified scan summation of operand a along dimension \(D\). The binary operation \(\operatorname{op}(x, y)\) is defined as

Overload 1: ct::add(x, y)

Overload 2: ct::add<Mode, SubMode>(x, y)

The identity element \(\mathrm{id}\) is \(0\) for integral scalar elements and positive zero for floating point elements of a.

The atomic constraint of overload (2) validates that:

  1. \(T\) is a basic floating point scalar

  2. Mode is a precise rounding mode

  3. If SubMode is round subnormals to zero, then \(T\) is float.

  4. The values Mode and SubMode are enumerators of their respective types.

Note

The binary operation \(\operatorname{op}\) is not associative for floating point arguments. There may be multiple valid results of the scan.

Undefined behavior may occur if any ordering or grouping of elements triggers signed integer overflow. See undefined behavior annex for an example.

Example

The following example performs a partial sum of the rows of a tile:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
using i32x2x4 = ct::tile<int, ct::shape<2, 4>>;

float xData[2][4] = {
  {3, 2, 1, 4},
  {-3, 2, 1, 5},
};

auto x = ct::load(&xData[0][0] + ct::iota<i32x2x4>());

auto r0 = ct::partial_sum(x, 1_ic); // No flags

auto r1 = ct::partial_sum(x, 1_ic,  // With flags
                          ct::round_ties_to_even_t{},
                          ct::preserve_subnormals_t{});
\[\begin{split}\begin{pmatrix} 3 & 2 & 1 & 4 \\ -3 & 2 & 1 & 5 \\ \end{pmatrix} \rightarrow \begin{pmatrix} 3 & 5 & 6 & 10 \\ -3 & -1 & 0 & 5 \end{pmatrix}\end{split}\]

cuda::tiles::partial_prod

template<
ct::integral auto D,
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>)
__tile__ T partial_prod(T a, ct::integral_constant<D> = {}) noexcept;
template<
ct::integral auto D,
ct::rounding_mode Mode = ct::default_rounding_mode(),
ct::subnormals_rounding_mode SubMode = ct::default_subnormals_rounding_mode(),
ct::arithmetic_tile T
>
requires (D >= 0) && (D < tile_rank_v<T>) && /* atomic constraint */
__tile__ T partial_prod(T a, ct::integral_constant<D>, ct::rounding_mode_constant<Mode>, ct::subnormals_rounding_mode_constant<SubMode> = {}) noexcept;

Yields an unspecified product scan of operand a along dimension \(D\). The binary operation \(\operatorname{op}(x, y)\) is defined as

Overload 1: ct::mul(x, y)

Overload 2: ct::mul<Mode, SubMode>(x, y)

The identity element \(\mathrm{id}\) is \(1\).

The atomic constraint of overload (2) validates that:

  1. \(T\) is a basic floating point scalar

  2. Mode is a precise rounding mode

  3. If SubMode is round subnormals to zero, then \(T\) is float.

  4. The values Mode and SubMode are enumerators of their respective types.

Note

The binary operation \(\operatorname{op}\) is not associative for floating point arguments. There may be multiple valid results of the scan.

Undefined behavior may occur if any ordering or grouping of elements triggers signed integer overflow. See the undefined behavior annex for an example.

Example

The following example performs a product scan along the rows of a tile:

namespace ct = ::cuda::tiles;
using namespace ct::literals;
using i32x2x4 = ct::tile<int, ct::shape<2, 4>>;

float xData[2][4] = {
  {3, 2, 1, 4},
  {-3, 2, 1, 5},
};

auto x = ct::load(&xData[0][0] + ct::iota<i32x2x4>());

auto r0 = ct::partial_prod(x, 1_ic); // No flags

auto r1 = ct::partial_prod(x, 1_ic,  // With flags
                          ct::round_ties_to_even_t{},
                          ct::preserve_subnormals_t{});
\[\begin{split}\begin{pmatrix} 3 & 2 & 1 & 4 \\ -3 & 2 & 1 & 5 \\ \end{pmatrix} \rightarrow \begin{pmatrix} 3 & 6 & 6 & 24 \\ -3 & -6 & -6 & -30 \end{pmatrix}\end{split}\]