General Principles

Header and Namespace

The Tile C++ APIs are available in the cuda_tile.h header. This header does not depend on the C++ standard library headers or on other CUDA Toolkit headers.

Unless otherwise specified, the Tile C++ APIs inhabit the cuda::tiles namespace. The cuda::tiles qualification may be abbreviated as ct in examples and documentation.

Example

Basic kernel returning the sum of two vectors using the cuda_tile.h header and cuda::tiles namespace.

#include "cuda_tile.h"

__tile_global__ void kernel(int* a, int* b, int* c) {
  namespace ct = ::cuda::tiles;

  auto idx = ct::bid().x;

  c[idx] = ct::add(a[idx], b[idx]);
}

The NVCC command line should target sm_80 or higher and should include the -std=c++20 and --enable-tile flags:

nvcc --enable-tile -std=c++20 -arch sm_80 main.cu

Language and Architecture Support

Use the --enable-tile flag to enable support for tile programming in NVCC and NVRTC. While tile kernels can be written with any C++ language standard, the CUDA Tile C++ APIs in cuda_tile.h including the ct::tile type require C++20 or higher.

The minimum supported target architecture for tile programming is sm_80. Tile code generation is implicitly disabled for earlier targets.

Note

NVCC targets sm_75 as the default architecture if no architecture is provided. In this configuration, no tile code will be generated and a tile kernel launch will fail at runtime.

API and ABI Stability

The initial release of CUDA Tile C++ is CUDA 13.3.

Idiomatic usage of the Tile C++ APIs will not be broken between minor CUDA Toolkit versions. The APIs may change between major CUDA Toolkit versions. Certain non-idiomatic uses of the APIs could be broken between minor releases. These scenarios are noted in the API reference where relevant.

The ABI of any type under the cuda::tiles namespace will not change between minor versions of the CUDA Toolkit. The ABI of these types may change between major versions of the CUDA Toolkit. If an ABI change occurs, the ABI version tag for the inline namespace containing the affected type will also be incremented. As a result, code using the new ABI may fail to link with host or device libraries expecting the old ABI. This ensures early detection of some ABI mismatch issues.

Only the APIs which are explicitly documented in this API reference have a stable interface. No guarantee is made about the behavior or availability of entities with a double underscore prefix or entities under the cuda::tiles::detail namespace.

Scalars

The scalars are fundamental data types that are natively supported by the tile programming model. The Tile C++ APIs operate either on scalars directly or on dense arrays of scalars known as tiles.

The following types are known as scalars: 1

Integral Scalars

An integral scalar is a possibly cv-qualified integral 5 type \(T\) whose bit size CHAR_BIT * sizeof(T) is \(8\), \(16\), \(32\), or \(64\).

Example

All of the following types are integral scalars on every platform supported by NVCC:

  • char, unsigned int, long long, char32_t, wchar_t, bool

The extended integer type int128_t which is available on some platforms is not an integral scalar because its bit size is greater than \(64\).

The type std::byte is not an integral scalar because enumerations are not integral 5.

Basic Floating Point Scalars

The following types and cv-qualified variants thereof are known as basic floating point scalars:

double

Models the IEEE 754 2 64-bit type.

float

Models the IEEE 754 2 32-bit type.

__half

Models the IEEE 754 2 16-bit type. Available in the cuda_fp16.h header.

__nv_bfloat16

Models the bfloat16 floating point format. In this format, 1 bit is stored for the sign, 8 bits for the exponent, and 7 bits for the mantissa. Available in the cuda_bf16.h header.

The basic floating point types may be used in the majority of Tile C++ arithmetic and math APIs.

The __nv_bfloat16 is considered to have appropriate IEEE 754 semantics for the purposes of defining the arithmetic and math APIs and associated rounding mode behaviors.

Note

The C++23 extended floating point types 6 are not considered to be basic floating point scalars even when provided by the platform. For example, none of the following types are basic floating point scalars:

  • std::bfloat16_t, std::float16_t, std::float32_t, std::float64_t, std::float128_t

Restricted Floating Point Scalars

The following data types and their cv-qualified variants are known as restricted floating point scalars:

__nv_fp8_e4m3

Models a floating point format with 1 sign bit, 4 exponent bits and 3 mantissa bits. Available in the cuda_fp8.h header.

__nv_fp8_e5m2

Models a floating point format with 1 sign bit, 5 exponent bits and 2 mantissa bits. Available in the cuda_fp8.h header.

__nv_tf32

Models the tensor float 32 floating point format. This format has 1 sign bit, 8 exponent bits, and 10 or more mantissa bits. The total storage size is 32 bits and the alignment is 4 bytes. Available in the cuda_tf32.h header.

The restricted floating point types may be loaded and stored from memory and used in matrix multiplication functions, however their support in arithmetic and math operations is limited.

The __nv_fp8_e4m3 and __nv_fp8_e5m2 types are supported only for sm_90 and later architectures in tile code.

Pointer Scalars

The following types and their cv-qualified variants are known as pointer scalars:

  1. Pointers to numeric scalar types

  2. Pointers to possibly cv-qualified void

Example

Multi-level pointers and pointers to structs, arrays, unions, and functions are not considered to be pointer scalars and may not be used as tile elements.

Pointer Scalars

Not Pointer Scalars

int* const

int**

double const*

int Foo::*

void*

void (*)(int, double)

__half volatile const*

Foo const*

Floating Point Scalars

The basic floating point scalars and restricted floating point scalars are collectively known as floating point scalar types.

Arithmetic Scalars

The integral scalars and basic floating point scalars are collectively known as arithmetic scalar types. The arithmetic scalars can be used in most arithmetic APIs.

Numeric Scalars

The integral scalars and floating point scalars are collectively known as numeric scalar types.

Properties of Integral Scalars

Within this document, the integral scalars are understood to have the following properties which are useful when defining the arithmetic operations:

Numeric Value

The numeric value of an object of integral scalar type is the integer value used for performing arithmetic on that object. For all objects of integral type except bool, the numeric value is the value of the object. For objects of type bool, the numeric value is \(0\) if the object is false and \(1\) if the object is true.

Bitwidth

The bitwidth of an integral scalar type is the number of bits needed to encode the set of values representable by that type. For bool, the bitwidth is \(1\). For all other integral scalar types \(T\), the bitwidth is sizeof(T) * CHAR_BIT.

Base Two Representation

The base two representation of an integer value \(x\) is the unique string of \(N\) bits \(b_{0}b_{1}\ldots b_{N-1}\) such that

\[x \operatorname{mod} 2^N = \sum_{k=0}^{N-1} b_{k} \cdot 2^k\]

where each \(b_i\) has value \(0\) or \(1\) and \(N\) is the bitwidth of \(x\).

Note

The arithmetic operations do not perform integral promotion, so it is necessary to define a numerical and bitwise representation for type bool.

Extents

The Tile C++ APIs use ct::extents to describe the shape of multi-dimensional arrays and tiles. A ct::extents object specifies a length for each dimension of the shape. Information about a dimension’s length may be stored in the object at runtime or may be encoded in the object’s type at compile time.

Example

Three extents objects each describing a shape of \(4 \times 8\). The non-type template parameters of ct::extents determine which dimensions are stored at runtime.

namespace ct = ::cuda::tiles;

// All compile time dimensions
ct::extents<uint32_t, 4, 8> a{};

// One compile time, one runtime dimension
ct::extents<uint32_t, ct::dynamic_extent, 8> b{4};

// All runtime dimensions
ct::extents<uint32_t, ct::dynamic_extent, ct::dynamic_extent> c{4, 8};

When a dimension’s length is stored at runtime, the first template parameter of ct::extents determines the integral type used for storing the dimension value. This type is called the index type and it is also used when performing computations involving the dimension’s length.

Example

Extents objects with various index types.

namespace ct = ::cuda::tiles;

// Uses int16_t as the index type
ct::extents<int16_t, 3, ct::dynamic_extent> a{2};

// Uses uint64_t as the index type
ct::extents<uint64_t, ct::dynamic_extent, ct::dynamic_extent> b{4, 5};

Further information on the ct::extents type can be found in the extents API reference.

Extents Like

In addition to ct::extents, any type which satisfies the requirements of extents like can be used in Tile C++ APIs that expect a shape. This allows Tile C++ to interoperate with extents implementations that are provided by other C++ libraries.

Let \(E\) be a possibly cv-qualified type and \(e\) a glvalue expression of type E const. Let dim be a glvalue expression of type E::rank_type const when that is well-formed.

\(E\) is extents like if \(E\) models the standard library concept std::copyable 22 and the following expressions are well formed and fulfill the indicated requirements:

Expression

Requirements

E::rank_type

A cv-unqualified integral 5 type.

Semantics

Type used for identifying a dimension by its index.

E::index_type

A cv-unqualified integral 5 type.

Semantics

Type used to represent the length of a runtime dimension.

E::rank()

Static member function returning a prvalue of type E::rank_type. Must be a constant expression.

Semantics

Indicates the number of dimensions encoded in the extents type. The return value shall be non-negative and the expression must be equality-preserving. 23

E::rank_dynamic()

Static member function returning a prvalue of type E::rank_type. Must be a constant expression.

Semantics

Indicates the number of dimensions which are known only at runtime. The expression shall be equality-preserving. 23

The return value must be exactly the number of dimensions whose length is as reported as ct::dynamic_extent by E::static_extent(i) for \(0 \leq i < \text{E::rank()}\).

E::static_extent(dim)

Static member function returning a prvalue of type size_t.

Semantics

Yields the statically known length at dimension dim. If the length at dim is dynamic, yields ct::dynamic_extent.

No requirements are placed on the behavior for values of dim that are not in the range \(0 \leq dim < \text{E::rank()}\). If dim is a constant expressions, the expression as a whole must be a constant expression.

The expression must be equality-preserving. 23

e.extent(dim)

Non-static member function returning a prvalue of type E::index_type.

Semantics

Yields the length at dimension dim. The expression must be equality-preserving. 23 The return value shall be non-negative.

For each value \(0 \leq dim < \text{E::rank()}\), one of the following conditions holds:

  1. e.extent(dim) == E::static_extent(dim), or

  2. E::static_extent(dim) == ct::dynamic_extent

No requirements are placed on the behavior for values of dim that are not in the range \(0 \leq dim < \text{E::rank()}\).

The behavior is undefined if E::static_extent(dim) is not representable in the type E::index_type and is not ct::dynamic_extent.

If \(E\) does not fulfill the semantic requirements above, any usage of \(E\) in a Tile C++ API results in undefined behavior.

Note

The size_type member alias is not required to be present. \(E\) need not be equality comparable or default constructible. Comparisons of extents are done via ct::extents_equal().

Properties of Extents

This section describes properties which are common to all extents like types.

Let \(e\) be an object of extents like type \(E\).

extents rank

The rank of \(E\) is given by E::rank() and indicates the number of dimensions encoded in the extents.

Example

The rank of ct::extents<uint32_t, 3, 4, 5> is \(3\).

The rank of ct::extents<uint32_t> is \(0\). A rank \(0\) extents is considered to have exactly one element.

static dimension
dynamic dimension

A static dimension is a dimension of \(E\) which is known at compile time. A dynamic dimension is a dimension of \(E\) known only at runtime. A dimension \(dim\) is dynamic if E::static_extent(dim) yields ct::dynamic_extent. Otherwise it is static.

Example

In the type ct::extents<uint32_t, 4, ct::dynamic_extent, 7>, dimensions \(0\) and \(2\) are static and dimension \(1\) is dynamic.

dynamic rank

The dynamic rank of \(E\) is the number of dynamic dimensions of \(E\). The dynamic rank is given by E::dynamic_rank().

extents shape

The shape of \(e\) is described by the E::static_extent and e.extent member functions. When E::static_extent(dim) returns the special sentinel value ct::dynamic_extent, the length of dimension \(dim\) is known only at runtime and may be retrieved with e.extent(dim). Otherwise, E::static_extent(dim) returns the statically known length of dimension \(dim\).

The shape of \(e\) is the collection of lengths given by:

\[e.\text{extent}(0) \times e.\text{extent}(1) \times \cdots \times e.\text{extent}(E\text{::rank}() - 1)\]

The notation \(e_i\) refers to the length of dimension \(i\) of object \(e\). In the case where all the dimensions are statically known, the notation \(E_i\) may be used to refer to the statically known length of dimension \(i\) given the type \(E\).

Example

The following objects all have shape \(3 \times 4 \times 7\):

  • ct::extents<uint32_t, 3, 4, 7>{}

  • ct::extents<uint32_t, 3, ct::dynamic_extent, 7>{4}

  • ct::extents<uint32_t, ct::dynamic_extent, 4, ct::dynamic_extent>{3, 7}

singleton dimension

A singleton dimension of \(e\) is a dimension of length 1.

Example

In the following object, dimension \(1\) is singleton

  • ct::extents<uint64_t, 4, 1>{}

extents size

The size of \(e\) is the product of its dimensions and represents the number of elements in a multi-dimensional array whose shape is \(e\). If the rank of \(e\) is \(0\) its size is \(1\).

When an \(e\) has no dynamic dimensions, its size is a property of the type \(E\) and may be computed at compile time.

The size is understood to be a mathematical property of \(e\); it is a well formed value even if its computation in finite precision would trigger integer overflow.

When not clear from context, the term object size shall refer to the byte size of the object \(e\).

Example

The size of ct::extents<uint32_t, 4, 8> is \(32\).

The size of ct::extents<uint32_t, 5> is \(5\).

The size of ct::extents<uint32> is \(1\).

extent equivalent

Two objects of extents like type are said to be extent equivalent if they have the same rank and their corresponding extent values are equal, regardless of their index types, rank types, or whether a given dimension is static or dynamic.

Example

The following two objects are extent equivalent despite having different types:

  • ct::extents<int16_t, 4, 8>{}

  • ct::extents<uint32_t, ct::dynamic_extent, 8>{4}

extents index space

Given \(N\) as the rank of \(e\), the index space of \(e\) is the set of integer \(N\)-tuples that form valid indices into a multi-dimensional array described by \(e\):

\[[0, e_0) \times [0, e_1) \times \cdots \times [0, e_{N-1})\]

In the case where \(E\) has only statically known dimensions, the index space of \(E\) refers to the index space of any instance of \(E\).

Example:

The index space of the object ct::extents<2, ct::rank_dynamic>{3} includes the following indices:

  • \((0, 0)\), \((0, 1)\), \((0, 2)\), \((1, 0)\), \((1, 1)\), \((1, 2)\)

shape like

The type \(E\) is said to be shape like if all of its dimensions are statically known.

For convenience, the alias template ct::shape can be used to specify a ct::extents with a uint32_t index type and all static dimensions.

Example

The following expressions produce the same type:

  • ct::shape<4, 5>

  • ct::extents<uint32_t, 4, 5>

shape equivalent

Two shape like types \(T\) and \(U\) are said to be shape equivalent if every instance \(a\) of \(T\) is extent equivalent to every instance \(b\) of \(U\).

Example

The following two types are shape equivalent:

  • ct::extents<uint32_t, 4, 8>

  • ct::extents<uint64_t, 4, 8>

Tiles

The fundamental unit of tile programming is the ct::tile<E, S> data type. This type represents an immutable multi-dimensional array of scalar elements with a compile-time known shape. The template parameter \(E\) specifies the element type and while \(S\) specifies a shape like ct::extents describing the tile shape.

Example

A \(4 \times 8\) tile object with int elements. Note that ct::shape is an alias for ct::extents with a uint32_t index type.

namespace ct = ::cuda::tiles;

ct::tile<int, ct::shape<4, 8>> obj;

For additional information about the ct::tile type, see the tile type reference.

Tile Like Types

The scalar types and the possibly cv-qualified specializations of ct::tile are together known as tile like types. These types implement a common abstraction that can be handled uniformly in Tile APIs.

The Tile C++ APIs generally treat scalars the same way they would handle rank \(0\) tiles. For example, the types int and ct::tile<int, ct::shape<>> behave similarly in Tile C++.

Example

The following types are all tile like types:

  1. double

  2. int const

  3. ct::tile<double, ct::shape<>>

  4. ct::tile<float, ct::shape<4, 8>>

  5. ct::tile<float, ct::shape<1, 1>> const volatile

Properties of Tiles

This section describes the properties common to all tile like types. Let \(T\) be a tile like type.

notation

Let \(a\) be an object of tile like type \(T\) with rank \(N\).

The notation \(T_i\) indicates the length of dimension \(i\) in the shape of \(T\).

For an index \(I = (i_0, i_1, \ldots , i_{N-1})\) in the index space of \(T\), the notation \(a(i_0, i_1, …, i_{N-1})\) and \(a(I)\) denote the element of \(a\) at index \(I\).

tile element type

The element type of \(T\) is

The element type of \(T\) is always a cv-unqualified scalar type.

Example

The element type of ct::tile<double, ct::shape<4>> is double.

The element type of int is int.

tile compatible shape

A shape like type \(S\) with rank \(N\) is said to be tile compatible if all the following hold:

  1. S::index_type is uint32_t

  2. Each dimension length \(S_i\) is a power of \(2\) for \(0 \leq i < N\).

  3. Each dimension length \(S_i\) does not exceed an implementation defined limit for \(0 \leq i < N\).

  4. The size of \(S\) does not exceed an implementation defined limit.

  5. \(S\) is a specialization of ct::extents.

Example

ct::shape<4, 7> is not tile compatible because \(7\) is not a power of two.

ct::shape<0> is not tile compatible because \(0\) is not a power of two.

ct::extents<int16_t, 4, 8> is not tile compatible because its index type is not uint32_t.

ct::shape<> is tile compatible.

tile shape

The shape type of \(T\) is

  • ct::extents<uint32_t> if \(T\) is a scalar.

  • T::shape_type if \(T\) is a specialization of ct::tile.

The shape type of \(T\) is always tile compatible.

Example

The shape type of ct::tile<int, ct::shape<4, 8>> is ct::shape<4, 8>.

The shape type of int is ct::shape<>.

tile rank
tile size
tile index space

The rank, size, and index space of \(T\) refer to the rank, size, and index space of the shape type of \(T\).

row major arrangement

Let \(a\) be an instance of \(T\) and \(N\) be its rank.

The row major arrangement of \(a\) is a sequence of elements \(s\) such that the following statement holds for any index \(I = (i_0, i_1, … i_{N-1})\) in the index space of \(a\):

\[a(I) = s(i_0 \cdot ( T_1 \cdot \ldots \cdot T_{N-1} ) + i_1 \cdot (T_2 \cdot \ldots \cdot T_{N-1}) + \cdots + i_{N-1})\]

Equivalently:

\[a(I) = \sum_{j=0}^{N - 1} i_{j} \cdot \prod_{k=j+1}^{N - 1} T_k\]

Example

Consider a \(4 \times 2\) tile ct::tile<int, ct::shape<4, 2>> representing the elements:

\[\begin{split}\begin{pmatrix} 0 & 1 \\ 2 & 3 \\ 4 & 5 \\ 6 & 7 \\ \end{pmatrix}\end{split}\]

Its row major arrangement is the sequence

\[(0, 1, 2, 3, 4, 5, 6, 7)\]
singleton tile

If \(T\) has exactly one element it is known as a singleton tile.

Example

All the following types are singleton tiles:

  • double

  • ct::tile<double, ct::shape<>>

  • ct::tile<double, ct::shape<1, 1>>

A specialization of ct::tile containing exactly one element is not a scalar but it is a singleton tile.

integral tile
basic floating point tile
restricted floating point tile
floating point tile
arithmetic tile
numeric tile
pointer tile

These terms indicate a tile like type whose element type is a scalar of the appropriate category.

Example

The following types are integral tiles:

  • int

  • ct::tile<int, ct::shape<4>>

The following types are pointer tiles:

  • double*

  • ct::tile<double*, ct::shape<8>>

elementwise operation

Consider a set of \(k\) tile like operands \(a_1, a_2, \ldots, a_k\) of the same type \(T\). An elementwise operation is the application of some operator \(\operatorname{op}(e_1, e_2, \ldots, e_{k})\) to the corresponding elements of each operand to form a new result of type \(T\). The result \(r\) is defined as

\[r(I) = \operatorname{op}(a_1(I), a_2(I), \ldots, a_k(I))\]

for each index \(I\) in the index space of \(T\).

Conversions

Many Tile C++ APIs convert between tile like types before performing a computation. These conversions are broadly categorized into shape conversions, element type conversions, and arithmetic conversions.

Extended Floating Point Types

For the purposes of defining the conversions of this section, the following floating point scalar types are considered to be extended floating point types 6:

The floating point conversion ranks of these types form a partial order such that the each type on the left side of the following table has a lesser conversion rank than the corresponding type on the right:

Lesser Rank

Greater Rank

__nv_fp8_e4m3

__half

__nv_fp8_e4m3

__nv_bfloat16

__nv_fp8_e5m2

__half

__nv_fp8_e5m2

__nv_bfloat16

__half

__nv_tf32

__nv_bfloat16

__nv_tf32

__nv_tf32

float

float

double

Of the floating point types which are supported as scalars, no two types have the same floating point conversion rank and no subrank ordering is defined.

Example

The conversion ranks of __nv_fp8_e4m3 and __nv_fp8_e5m2 are unordered with respect to each other.

The conversion ranks of __half and __nv_bfloat16 are unordered with respect to each other.

The C++23 extended floating point types (for example std::float16_t) are not supported as scalars and no conversion rank or subrank for them is defined with respect to the floating point scalar types.

Element Type Conversions

scalar conversion

The scalar conversions are a set of behaviors for converting an object of one scalar type to another scalar type. A scalar conversion from an object of type \(T\) to an object of type \(U\) is a standard conversion sequence 7 which converts \(T\) to \(U\), except that the following additional behaviors are specified:

  1. Any floating point scalar type may be converted to any other floating point scalar type as part of the standard floating-point conversions 8. The conversion exists regardless of whether the source and destination types are standard or extended floating point types.

    Note

    This rule relaxes the floating-point conversions 8 by allowing conversions to and from extended floating point types of lower or unordered conversion rank.

    A conversion from a floating point scalar to a floating point scalar of a lower or unordered conversion rank is still considered to be a narrowing conversion.

    Example

    A scalar conversion exists from double to __half.

  2. As part of the floating-point conversions 8, the convertFormat 24 operation with the roundTiesToEven 3 rounding mode is used to convert the source to the destination, except that:

    1. If the target type is __nv_fp8_e5m2 or __nv_fp8_e4m3 and the source value is non-finite or lies outside the representable finite range of the target type, the result value is unspecified.

    Note

    This rule specifies an implementation defined behavior of the standard floating-point conversions 8.

  3. As part of the floating-integral conversions 9, if the destination is a floating point scalar the convertToIntegerTiesToEven 25 operation is used to convert the numeric value of the source to the destination.

    If the source integer value is outside the representable finite range of the destination type, then

    1. If the target type is __nv_fp8_e5m2 or __nv_fp8_e4m3, the result is unspecified.

    2. Otherwise, the result is an infinity whose sign matches the sign of the (non-zero) source value.

    Note

    This rule specifies an implementation defined behavior of the standard floating-integral conversions 9 and removes undefined behavior for out of range source values.

For the purposes of this conversion, the restricted floating point scalars and __nv_bfloat16 are considered to be IEEE 754 arithmetic formats of appropriate range and precision.

Note

The precision of __nv_tf32 is not fully specified, so conversions involving __nv_tf32 are not fully specified.

If no standard conversion sequence 7 with the above modifications exists from an object of type \(T\) to the object of type \(U\), no scalar conversion exists for those objects. If the standard conversion sequence from \(T\) to \(U\) would be narrowing, we say the scalar conversion from \(T\) to \(U\) is narrowing.

The above rules only affect certain tile C++ APIs that make use of the scalar conversions. The behavior of core C++ language semantics or builtin arithmetic operations is not affected by the above rules.

Example

There exists a scalar conversion from __nv_bfloat16 to __half. The round nearest even rounding mode is used to select the result value. The conversion is narrowing because the conversion rank of __nv_bfloat16 is unordered with respect to __half.

tile conversion

The tile conversions are a set of behaviors for converting an object \(a\) of tile like type \(T\) to a tile like type \(U\). A tile conversion exists from \(T\) to \(U\) if there exists a scalar conversion from the element type of \(T\) to the element type of \(U\) and the shapes of \(T\) and \(U\) are shape equivalent.

The result of the conversion is a new object of type \(U\) formed by the elementwise application of scalar conversion to the element type of \(U\).

A tile conversion is said to be narrowing if it invokes a narrowing scalar conversion.

Example

There exists a tile conversion from int to ct::tile<int, ct::shape<>>.

There does not exist a tile conversion from int to ct::tile<int, ct::shape<1>>.

bool tile conversion

The bool tile conversion of an object \(a\) of tile like type is the process of converting \(a\) to type ct::tile_with_element_t<T, bool> using tile conversion.

Note

All tile like are bool tile convertible.

The result of bool tile conversion of a scalar is a scalar.

Shape Conversions

A tile may be converted to a tile of a different shape through broadcasts. A broadcast will replicate elements of the tile along a given dimension to expand the tile to the desired shape. There are two styles for this conversion:

  1. One tile can be broadcasted to match a specific shape. This is know as broadcast conversion.

  2. Two tiles can be broadcasted simultaneously so that their shapes match. This is called mutual broadcast conversion.

The following sections describe these conversions along with supporting definitions.

broadcastable to

Let \(S\) and \(B\) be two shape like types with ranks \(N\) and \(M\) respectively. We say that \(S\) is broadcastable to \(B\) if \(N \leq M\) and for each dimension \(0 \leq i < N\) one of the following conditions holds:

  • \(S_i = B_{i + M - N}\) or

  • \(S_i = 1\)

For a tile like type \(T\), we say that \(T\) is broadcastable to \(B\) if its shape is broadcastable to \(B\) and \(B\) is tile compatible.

Example

The shape on the left hand side of the table is broadcastable to the corresponding shape on the right:

Source

Broadcast Target

ct::shape<4, 1>

ct::shape<4, 8>

ct::shape<2>

ct::shape<4, 2>

ct::shape<5, 2>

ct::shape<5, 2>

Each tile like type on the left is broadcastable to the shape on the right:

Source

Broadcast Target

double

ct::shape<4, 8>

ct::tile<float, ct::shape<2>>

ct::shape<8, 2>

ct::tile<float, ct::shape<2, 1, 8>>

ct::shape<2, 16, 8>

broadcast conversion

Let \(a\) be an object of tile like type \(T\) with shape \(S\) and rank \(N\). Let \(B\) be a tile compatible shape of rank \(M \geq N\) such that \(T\) is broadcastable to \(B\).

The broadcast conversion of \(a\) to shape \(B\) replicates the values of \(a\) along its singleton dimensions to match the shape \(B\).

The broadcast conversion of \(a\) to \(B\) is a ct::tile object \(b\) whose element type matches that of \(a\) and whose shape is \(B\). For each index \(I = (i_0, i_1, \cdots, i_{M-1})\) in the index space of \(B\),

\[b(i_0, i_1, \ldots, i_{M-1}) = a(f_0(i_{M-N}), f_1(i_{1 + M - N}), \ldots, f_{N-1}(i_{M - 1}))\]

For each source dimension \(0 \leq k < N\), the function \(f_k\) is defined as

\[\begin{split}\begin{cases} f_k(i) = 0 & \text{if} S_k = 1 \\ f_k(i) = i & \text{otherwise} \end{cases}\end{split}\]

The broadcast conversion of an object is always a specialization of ct::tile even if the source is a scalar type.

Example 1

Consider the following tile of type ct::tile<int, ct::shape<4, 1>>:

\[\begin{split}\begin{pmatrix} 1 \\ 2 \\ 3 \\ 4 \end{pmatrix}\end{split}\]

The broadcast conversion of this object to the shape ct::shape<4, 8> is the tile object ct::tile<int, ct::shape<4, 8>>:

\[\begin{split}\begin{pmatrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 2 & 2 & 2 & 2 & 2 & 2 & 2 & 2 \\ 3 & 3 & 3 & 3 & 3 & 3 & 3 & 3 \\ 4 & 4 & 4 & 4 & 4 & 4 & 4 & 4 \end{pmatrix}\end{split}\]

Example 2

Consider the following tile of type ct::tile<int, ct::shape<2>>:

\[\begin{pmatrix} 1 & 2 \end{pmatrix}\]

The broadcast conversion of this object to the shape ct::shape<4, 2> is the tile object ct::tile<int, ct::shape<4, 2>>:

\[\begin{split}\begin{pmatrix} 1 & 2 \\ 1 & 2 \\ 1 & 2 \\ 1 & 2 \end{pmatrix}\end{split}\]
shape mutual broadcast compatible

Let \(T\) and \(U\) be two shape like types with ranks \(N\) and \(M\) respectively. Consider the common suffix of their dimensions formed by \(T_{N - 1 - i}\) and \(U_{M - 1 - i}\) for \(0 \leq i < \operatorname{min}(N, M)\).

We say that \(T\) and \(U\) are mutual broadcast compatible if for each \(i\), any of the following hold:

  • \(T_{N - 1 - i} = U_{M - 1 - i}\) or

  • \(T_{N - 1 - i} = 1\) or

  • \(U_{M - 1 - i} = 1\)

Example

The following pairs of types are mutual broadcast compatible:

  • ct::shape<5, 1> and ct::shape<1, 3>

  • ct::shape<2, 4, 1> and ct::shape<4, 3>

  • ct::shape<> and ct::shape<3, 4, 5>

The types ct::shape<4, 2> and ct::shape<5, 2> are not broadcast compatible because the first dimensions are non-singleton and do not match.

mutual broadcast shape

Let \(T\) and \(U\) be broadcast compatible with ranks \(N\) and \(M\) respectively. Without loss of generality, assume \(N \leq M\). The mutual broadcast shape of \(T\) and \(U\) is a cv-unqualified specialization of ct::extents \(B\) satisfying the following conditions:

  • The rank of \(B\) is \(M\)

  • The index type of \(B\) is uint32_t

  • \(B_i = U_i\) for each \(i\) in \(0 \leq i < M - N\)

  • \(B_i = \operatorname{max}(T_{i - (M - N)}, U_i)\) for each \(i\) in \(M - N \leq i < M\).

The broadcast type of \(T\) and \(U\) is a specialization of ct::extents even if neither \(T\) nor \(U\) are specializations of ct::extents.

Example

The third column in the following table is the mutual broadcast shape of the types in the first two columns:

Type 1

Type 2

Broadcast Type

ct::shape<4, 1>

ct::shape<4, 8>

ct::shape<4, 8>

ct::shape<1, 5>

ct::shape<3, 1>

ct::shape<3, 5>

ct::shape<4, 2, 1>

ct::shape<2, 6>

ct::shape<4, 2, 6>

ct::shape<4>

ct::shape<1, 2, 1>

ct::shape<1, 2, 4>

tile mutual broadcast compatible

Let \(T\) and \(U\) denote two tile like types. We say that \(T\) and \(U\) are mutual broadcast compatible if:

  1. The shapes of \(T\) and \(U\) are mutual broadcast compatible, and

  2. The resulting mutual broadcast shape is a tile compatible shape.

The mutual broadcast shape might not be tile compatible if it exceeds the implementation defined maximum size for a tile shape.

Example

The following types are all pairwise broadcast compatible

  1. ct::tile<int, ct::shape<2, 1, 8>>

  2. ct::tile<int, ct::shape<1, 4, 1>>

  3. ct::tile<float, ct::shape<4, 8>>

  4. double

mutual broadcast conversion

Let \(a\) and \(b\) be objects of broadcast compatible tile like types \(T\) and \(U\) respectively. The mutual broadcast conversion of \(a\) and \(b\) is a process for converting \(a\) and \(b\) to a common shape. It proceeds as follows:

  1. Let \(B\) denote the broadcast shape for the shapes of \(T\) and \(U\). Perform broadcast conversion to \(B\) on the objects \(a\) and \(b\) to yield two new objects of ct::tile type. For the remaining stage, let \(a\) and \(b\) denote the result of the preceding conversion.

  2. If either \(T\) or \(U\) are specializations of ct::tile, the process is done and the converted results are \(a\) and \(b\).

  3. Otherwise, \(T\) and \(U\) are both scalar types and the result of the preceding step are two singleton tiles. Form two scalar objects by extracting the single element from \(a\) and \(b\) respectively. The two scalar objects are the result of the conversion.

Note

The result of mutual broadcast conversion between a scalar and a tile is a pair of tiles.

Example

Consider the following two tiles whose types are ct::tile<int, ct::shape<1, 4>> and ct::tile<int, ct::shape<2, 1>> respectively:

\[\begin{split}\begin{pmatrix} 1 & 2 & 3 & 4 \end{pmatrix} \quad \quad \quad \begin{pmatrix} 5 \\ 6 \end{pmatrix}\end{split}\]

The mutual broadcast conversion of these objects are two tiles of type ct::tile<int, ct::shape<2, 4>>:

\[\begin{split}\begin{pmatrix} 1 & 2 & 3 & 4 \\ 1 & 2 & 3 & 4 \end{pmatrix} \quad \quad \begin{pmatrix} 5 & 5 & 5 & 5 \\ 6 & 6 & 6 & 6 \end{pmatrix}\end{split}\]

Arithmetic Conversions

The arithmetic conversions are used for converting among arithmetic tiles in binary operations. In general, an arithmetic conversion will first broadcast the operands to a common shape, then perform tile conversion to make a common element type. Unlike standard C++ arithmetic conversions, these arithmetic conversions do not perform integer promotion.

Two variants of arithmetic conversions are used depending on the operation. For most binary arithmetic operations, the data type of a tile operand is preferred over scalar operands. However for comparison operations, whether or not the operands are tiles or scalars is not considered when determining the common data type to convert to.

The following section describes the behavior of arithmetic conversions.

arithmetic common type

Let \(T\) and \(U\) be arithmetic scalar types.

The arithmetic common type of \(T\) and \(U\) is a type \(C\) to which both types may be scalar converted. This conversion roughly corresponds to the usual C++ arithmetic conversions without integer promotion behavior.

For the purposes of determining \(C\), the non-standard floating point scalar types are considered to be extended floating point types of appropriate conversion ranks as described in extended floating point types.

\(C\) is determined as follows:

  1. If either \(T\) or \(U\) are floating point scalar, \(C\) is the common type determined by the C++ usual arithmetic conversions 10. If \(C\) could not be determined according to these rules, the arithmetic common type does not exist.

  2. Otherwise, both \(T\) and \(U\) are integral scalar types. If \(T\) and \(U\) are the same type, \(C\) is that type.

  3. Otherwise, if one of \(T\) or \(U\) is signed and the other is unsigned, the following rules apply. Let \(S\) be the signed type and \(U\) the unsigned type:

    1. If the conversion rank of \(U\) is greater than the conversion rank of \(S\), \(C\) is \(U\)

    2. Otherwise, if \(S\) can represent all values of \(U\), \(C\) is \(S\).

    3. Otherwise, \(C\) is the unsigned type corresponding to \(S\).

  4. Otherwise, \(T\) and \(U\) are have the same signedness. If one has a greater conversion rank than the other, \(C\) is the type of the greater rank.

  5. Otherwise, \(T\) and \(U\) are distinct integral types of the same signedness and conversion rank. \(C\) is the type with the greater conversion subrank defined as follows:

    1. The subranks of char, char8_t, char16_t, char32_t, and wchar_t are each less than the subrank of their corresponding underlying integer 26 type.

    2. If char has the same underlying type as any of char8_t, char16_t, char32_t, or wchar_t, the subrank of char is less than the subrank of the other type.

Example

The table below shows examples of the arithmetic common type on a typical target system.

Type 1

Type 2

Common Type

int

double

double

__half

float

float

__half

__nv_bfloat16

<illformed>

short

short

short

char16_t

unsigned short

unsigned short

arithmetic comparison conversion
arithmetic tile conversion

The arithmetic tile conversion and arithmetic comparison conversion receive two tile like operands and convert them to a common element type and shape.

Let \(a\) and \(b\) be objects of tile like type \(T\) and \(U\) with element types \(TE\) and \(UE\) respectively. A common element type \(E\) is determined as follows:

  1. For arithmetic comparison conversion, \(E\) is the arithmetic common type of \(TE\) and \(UE\).

  2. For arithmetic tile conversion, \(E\) is the arithmetic common type of \(TE\) and \(UE\) if \(T\) and \(U\) are both scalars or both non-scalars. Otherwise, exactly one of \(T\) or \(U\) is a scalar and \(E\) is the element type of the non-scalar.

If \(E\) could not be determined because the arithmetic common type doesn’t exist, the conversion is ill-formed.

After \(E\) is determined, the conversion proceeds as follows:

  1. Operands \(a\) and \(b\) undergo mutual broadcast conversion to form new objects \(a'\) and \(b'\) of the same shape \(B\). If the mutual broadcast conversion is ill-formed, the arithmetic conversion as a whole is ill-formed.

  2. If \(a\) and \(b\) are both scalars, \(a'\) and \(b'\) undergo tile conversion to \(E\).

  3. Otherwise, \(a'\) and \(b'\) undergo tile conversion to ct::tile<E, B>.

The conversion as a whole is ill-formed if the final tile conversion would be narrowing unless the source element type of the narrowing conversion is integral and the target is floating point.

Example 1

In the following code, arithmetic tile conversion is used to convert the operands prior to performing the addition. The float values are converted to double and both arguments are broadcasted to a common shape:

\[\begin{split}\begin{pmatrix} 2 & 6 \end{pmatrix} - \begin{pmatrix} 4 \\ 1 \end{pmatrix} = \begin{pmatrix} -2 & 2 \\ 1 & 5 \end{pmatrix}\end{split}\]
namespace ct = ::cuda::tiles;
using i32x1x2 = ct::tile<int, ct::shape<1, 2>>;
using i32x2x1 = ct::tile<int, ct::shape<2, 1>>;
using f32x1x2 = ct::tile<float, ct::shape<1, 2>>;
using f64x2x1 = ct::tile<double, ct::shape<2, 1>>;
using f64x2x2 = ct::tile<double, ct::shape<2, 2>>;

float xData[1][2] = {
  {2, 6},
};

double yData[2][1] = {
  {4},
  {1},
};

f32x1x2 x = ct::load(&xData[0][0] + ct::iota<i32x1x2>());
f64x2x1 y = ct::load(&yData[0][0] + ct::iota<i32x2x1>());
f64x2x2 result = x - y;

Example 2

In (1), arithmetic tile conversion prefers the ct::tile argument when selecting a target element type. The double scalar would need to be narrowed to int making the conversion ill-formed.

In (2), arithmetic comparison conversion does not prefer either argument when determining the target element type. The elements of the ct::tile are converted to double and the comparison is well-formed.

namespace ct = ::cuda::tiles;
using i32x8 = ct::tile<int, ct::shape<8>>;

i32x8 x = ct::full<i32x8>(42);
2.0 * x;  // (1) ill-formed
2.0 == x; // (2) OK

Example 3

Example A

\(T\)

ct::tile<int, ct::shape<4, 1>>

\(U\)

ct::tile<float, ct::shape<1, 8>>

Arith. Tile Conversion

ct::tile<float, ct::shape<4, 8>>

Arith. Comp. Conversion

ct::tile<float, ct::shape<4, 8>>

Example B

\(T\)

float

\(U\)

ct::tile<int, ct::shape<4, 8>>

Arith. Tile Conversion

ill-formed: float -> int narrowing

Arith. Comp. Conversion

ct::tile<float, ct::shape<4, 8>>

Example C

\(T\)

__nv_bfloat16

\(U\)

ct::tile<__half, ct::shape<4, 8>>

Arith. Tile Conversion

illformed: __nv_bfloat16 -> __half narrowing

Arith. Comp. Conversion

ill-formed: Unordered conversion ranks

Example D

\(T\)

unsigned int

\(U\)

ct::tile<int, ct::shape<4, 8>>

Arith. Tile Conversion

ill-formed: unsigned int -> int narrowing

Arith. Comp. Conversion

ill-formed: int -> unsigned int narrowing

arithmetic tile promotion

The arithmetic tile promotion of an object \(a\) of tile like type is the elementwise application of the standard integral promotions rules 11 to \(a\). The result is a new object of the same shape but possibly different element type as \(a\). The kind of object is unchanged by the operation: if \(a\) was originally a scalar it remains a scalar. If \(a\) was original a ct::tile it remains a ct::tile.

Tensor Span

CUDA Tile C++ uses tensor span like types to describe multi-dimensional arrays in memory. A tensor span consists of

  1. A pointer to the base location of the array in memory.

  2. A layout mapping describing the array shape and the mapping of logical multi-dimensional indices to linear indices into memory.

  3. An accessor policy which decorates the tensor span with additional information about how elements of the array may be accessed.

For performance, data referenced by a tensor span should not be indexed directly. Instead, the tensor span should be wrapped in a view type that represents a tiling of the data. Tiles may be loaded from this view type directly.

Example

A \(2 \times 4\) tensor span wrapping a pointer to memory. It uses the default accessor policy and the default row-major layout.

namespace ct = ::cuda::tiles;
using namespace ct::literals;

float data[2][4] = {
  {0, 1, 2, 3},
  {4, 5, 6, 7}
};

auto span = ct::tensor_span{&data[0][0], ct::extents{2_ic, 4_ic}};

The ct::tensor_span type implements tensor span like and is similar to the C++23 std::mdspan type.

Layout Mapping

A layout mapping is an object describing the shape of a multi-dimensional array and its in-memory arrangement of elements. The shape is specified using an extents like type consisting of static or dynamic dimension values. The arrangement of elements is specified by a collection of statically or dynamically known stride values for each dimension which indicate the distance in memory between consecutive elements along that dimension. The strides determine a mapping from the multi-dimensional index space of array elements to their linear position in memory.

Let \(M\) denote a possibly cv-qualified type, \(m\) a const glvalue expression of type \(M\) and \(dim\) a const glvalue expression of type M::rank_type, when that is well-formed. \(M\) is a layout mapping type if \(M\) models the C++ standard library concept std::copyable and expressions in the following table are well formed and meet the indicated requirements:

Expression

Requirements

M::extents_type

A cv-unqualified extents like type.

Semantics

Indicates the static components of the multi-dimensional array shape represented by this layout mapping.

M::rank_type

Same type as M::extents_type::rank_type.

M::index_type

Same type as M::extents_type::index_type.

M::layout_type

A cv-unqualified layout policy type such that M::layout_type::mapping<M::extents_type> is the same type as M

Semantics

Indicates the layout policy type which may be used to rebuild this layout mapping given only an extents like type.

M::is_always_strided()

Static member function returning a prvalue of type bool. The expression M::is_always_strided() shall be a constant expression and shall always yield true.

Semantics

Indicates that the mapping describes a strided layout. Currently, only strided layouts are supported in tensor span like types.

ct::layout_mapping_static_stride<M>{}(dim)

Specialization of traits class ct::layout_mapping_static_stride implementing the function call operator. The expression must yield a prvalue of type size_t.

Semantics

Yields the statically known stride for dimension dim. If that stride is dynamic, returns ct::dynamic_extent. The expression shall be equality-preserving. 23

No requirements are placed on the behavior for values of dim that are not in the range \(0 \leq dim < N\) where \(N\) is the rank of M::extents_type. For all other values of \(dim\) which are themselves constant expressions, the expression as a whole must be a constant expression.

m.stride(dim)

Non-static member function returning a prvalue of type M::index_type.

Semantics

Yields the runtime stride for dimension \(dim\). The expression shall be equality-preserving. 23 The return value shall be non-negative.

No requirements are placed on the behavior for values of dim that are not in the range \(0 \leq dim < N\) where \(N\) is the rank of M::extents_type.

The value of the expression shall be ct::layout_mapping_static_stride<M>{}(dim) when dimension \(dim\) corresponds to a static stride. In this case, the behavior is undefined if this value is not representable in M::index_type.

m.extents()

Member function yielding a const glvalue expression of type M::extents_type.

Semantics

Indicates the shape of the array. The expression shall be equality-preserving. 23

Note

A layout mapping need not model equality comparable. Comparison of layout mappings is done by ct::layout_mapping_equal().

If \(M\) does not fulfill the semantic requirements above, any usage of \(M\) in a Tile C++ API results in undefined behavior.

The remainder of this section describes the properties common to all layout mapping types. Let \(m\) be an object of layout mapping type \(M\) and let \(N\) be the rank of M::extents_type.

layout mapping shape
layout mapping rank
layout mapping size
layout mapping index space

The shape, rank, size and index space of \(m\) refers to the shape, rank, size, and index space of the extents like object m.extents().

layout mapping function

The layout mapping function of \(m\) is a function \(m : \mathbb{I} \rightarrow \mathbb{N}\) from the index space \(\mathbb{I}\) of \(m\) to natural numbers representing the location of each array index \(I = (i_0, i_1, \ldots, i_{N-1}) \in \mathbb{I}\) in a linear index space. The layout mapping function is defined as:

\[m(I) = \text{m.stride}(0) \cdot i_0 + \text{m.stride}(1) \cdot i_1 + \ldots + \text{m.stride}(N - 1) \cdot i_{N - 1}\]

The mapping function is said to be injective if for every pair of of distinct indices \(I\) and \(I'\) in the index space of \(m\), their mappings are distinct: \(m(I) \neq m(I')\).

layout mapping equivalent

Objects \(x\) and \(y\) are said to be layout mapping equivalent if all of the following hold:

  1. They are both layout mappings of rank \(N\).

  2. The objects x.extents() and y.extents() are extent equivalent.

  3. For each dimension \(dim\) in the range \(0 \leq dim < N\), the strides x.stride(dim) and y.stride(dim) are the same value irrespective of index type.

layout policy
layout policy mapping type

A layout policy is a factory for producing layout mapping types given an extents like type describing an array shape. Let \(P\) be a possibly cv-qualified type and \(E\) a cv-unqualified extents like type. \(P\) is a layout policy for extents type \(E\) if the expression P::mapping_type<E> is well formed and yields a layout mapping type such that P::mapping_type<E>::extents_type is \(E\).

Accessor Policy

An accessor policy describes how the elements of a multi-dimensional array are accessed. A type \(A\) is an accessor policy if \(A\) models the C++ standard library concept std::copyable and the following expressions are well-formed and have the indicated requirements:

Expression

Requirements

A::element_type

Possibly cv-qualified scalar type.

Semantics

Indicates the element type yielded by an access through \(A\).

A::data_handle_type

Same type as A::element_type*

A::reference

Same type as A::element_type&

ct::enable_contiguous_accessor_policy<A>

Specialization of variable template ct::enable_contiguous_accessor_policy yielding true.

Semantics

Indicates that accessor policy behaves like a simple pointer dereference into a contiguous region of memory. Attests that an access at size_t index \(idx\) from base pointer \(p\) of type A::data_handle_type is a read or write through the glvalue p[idx].

If \(A\) does not fulfill the semantic requirements above, any usage of \(A\) in a Tile C++ API results in undefined behavior.

Tensor Span Like

A tensor span like type represents a handle to a multi-dimensional array in memory. Let \(T\) be a type and \(t\) a const glvalue expression of type \(T\). \(T\) is tensor span like if it models the C++ standard library concept std::copyable and the expressions in the following table are well formed and fulfill the indicated requirements:

Expression

Requirements

T::mapping_type

A cv-unqualified layout mapping type

T::accessor_type

A cv-unqualified accessor policy type

T::extents_type

Same type as T::mapping_type::extents_type

T::layout_type

Same type as T::mapping_type::layout_type

T::index_type

Same type as T::extents_type::index_type

T::rank_type

Same type as T::extents_type::rank_type

T::element_type

Same type as T::accessor_type::element_type

T::value_type

Same type as remove-cv-t<T::element_type>

T::data_handle_type

Same type as T::accessor_type::data_handle_type

T::reference

Same type as T::accessor_type::reference

t.data_handle()

Non static member function yielding a prvalue of type T::data_handle_type.

Semantics

Produces a pointer to the start of a multi-dimensional array. The expression must be equality-preserving. 23

t.mapping()

Non static member function yielding a const lvalue of type T::mapping_type.

Semantics

Produces the mapping object describing the layout of the multi-dimensional array. The expression must be equality-preserving. 23

t.accessor()

Non static member function yielding a const lvalue of type T::accessor_type.

Semantics

Produces the accessor policy describing how elements of the multi-dimensional array are to be accessed.

If \(T\) does not fulfill the semantic requirements above, any usage of \(T\) in a Tile C++ API results in undefined behavior.

The remainder of this section describes properties common to all tensor span like types. Let \(T\) be a tensor span like type and \(t\) an object of that type.

tensor span element type
tensor span value type

The type T::element_type is called the element type of \(T\) and is the type of the elements of the multi-dimensional array referenced by \(t\). A const qualification of T::element_type indicates whether the underlying array may be written through an instance of \(T\). The value type of \(t\) is the element type without cv-qualifiers.

tensor span shape
tensor span rank
tensor span size
tensor span index space

The shape, rank, size, and index space of \(t\) refer to the shape, rank, size, and index space of the layout mapping object t.mapping().

tensor span notation

The notation \(t_k\) designates the length of dimension \(k\) for the shape of t.

tensor span function

Let \(m\) be the layout mapping function of t.mapping() and \(\mathbb{I}\) the index space of \(t\). The tensor span function \(t : \mathbb{I} \rightarrow P\) of \(t\) associates indices in \(\mathbb{I}\) to pointers. If \(p\) is the pointer produced by t.data_handle(), the mapping is defined as follows:

\[t(I) = p + m(I)\]

Numeric Modifiers

The following modifiers are provided as arguments to influence the numerical behavior of certain arithmetic and math APIs.

Rounding Mode

A rounding mode is a policy for selecting a floating point value to represent the result of a mathematical computation. This section describes the rounding mode behaviors while ct::rounding_mode documents the APIs for selecting rounding modes.

The precise rounding modes are rounding modes with well defined IEEE 754 2 semantics. These rounding modes yield an exact result if possible. Otherwise, one of the two consecutive floating point values which bound the infinitely precise result is yielded (subject to appropriate handling for non-finite values). The following precise rounding modes are defined:

Round Ties to Even

Indicates the roundTiesToEven 3 IEEE 754 rounding attribute. The value closest to the infinitely precise result is yielded (subject to certain corner cases which are documented in the IEEE 754 specification).

Round Toward Zero

Indicates the roundTowardZero 3 IEEE 754 rounding attribute. The value closest to but not greater in magnitude to the infinitely precise result is yielded.

Round Toward Negative

Indicates the roundTowardNegative 3 IEEE 754 rounding attribute. The value closest to but not greater than the infinitely precise result is yielded.

Round Toward Positive

Indicates the roundTowardPositive 3 IEEE 754 rounding attribute. The value closest to but not less than the infinitely precise result is yielded.

Note

The above listing is a summary of the rounding mode behavior and does not represent the exact semantics. Consult the IEEE 754 2 specification for the exact behavior of precise rounding modes.

Some operations support imprecise rounding modes. The imprecise rounding modes are not guaranteed to yield floating point values which are adjacent to the infinitely precise result. Maximum error bounds may be documented for the results of operations using imprecise rounding modes. The following imprecise rounding modes are defined:

Round Approximate

Indicates a rounding policy that prioritizes computation speed over precision.

Round Full

Indicates a rounding policy that balances computation speed and precision.

The semantics of the imprecise rounding modes depend on the operation. See the documentation of the relevant operation for details.

Default Rounding Mode

A distinguished rounding mode called the default rounding mode determines the rounding behavior for most floating point APIs when one is not directly provided by the user. The default rounding mode is Round Ties to Even.

Example

In the following code, the Round Toward Negative rounding mode is selected for the addition. The result is 8.0f instead 8.0f + 8 * eps which would be produced by the default behavior.

namespace ct = ::cuda::tiles;
float eps = 0x0.000002p0f;
float result = ct::add(8.0f, 5 * eps, ct::round_toward_negative_t{});

Note

Not all rounding modes are supported by all APIs or operand types. Consult the relevant API for details.

Subnormals Rounding Mode

A subnormals rounding mode is a policy for handling subnormal 4 inputs and results in arithmetic and math APIs. This section describes the subnormals rounding mode behaviors while ct::subnormals_rounding_mode documents the APIs for selecting subnormals rounding modes.

The function \(\operatorname{subround}\) accepts and returns a floating point value and applies a subnormals rounding mode as defined below:

Preserve Subnormals
\[\operatorname{subround}(x) = x\]

Subnormal values are not modified when passed as arguments or returned as results of arithmetic and math APIs

Round Subnormals to Zero
\[\begin{split}\operatorname{subround}(x) = \begin{cases} \\ 0.0 & \text{x is subnormal and has positive sign} \\ -0.0 & \text{x is subnormal and has negative sign} \\ x & \text{otherwise} \\ \end{cases}\end{split}\]

Subnormal operands and results are flushed to sign preserving zero in arithmetic and math APIs.

Default Subnormals Rounding Mode

A distinguished subnormals rounding mode called the default subnormals rounding mode determines the subnormals rounding behavior for operations where the mode is not directly specified by the user. The default subnormals rounding mode is Preserve Subnormals.

Example

In the following code, the Round Subnormals to Zero subnormals rounding mode is selected for the subtraction. The result of the subtraction would normally yield a subnormal value, however, because Round Subnormals to Zero is used, \(0.0\) is returned instead.

namespace ct = ::cuda::tiles;
float result = ct::sub(0x1.1p-126f, 0x1.0p-126f,
                       ct::round_ties_to_even_t{},
                       ct::round_subnormals_to_zero_t{});

Note

Not all subnormals rounding modes are supported by all APIs or operand types. Consult the relevant API for details.

NaN Propagation Mode

A NaN propagation mode determines how maximum and minimum operations should handle NaN inputs. This section describes the NaN propagation mode behaviors while ct::nan_propagation_mode documents the APIs for selecting a NaN propagation mode.

Suppress NaN

In this mode, maximum and minimum operations yield a NaN value only when both of their inputs are NaN. When one input is NaN and the other is a number, the number is returned.

Propagate NaN

In this mode, maximum and minimum operations yield a NaN value when either of their inputs are NaN.

Default NaN Propagation Mode

When a NaN propagation mode is not explicitly specified, the default nan propagation mode is used. The default NaN propagation mode is suppress NaN.

View Padding

Masked load operations on tile views such as ct::partition_view may optionally specify a padding value which determines the elements of the result tile corresponding to out of bounds loads. This section documents the view padding options while ct::view_padding documents the APIs for selecting view padding.

Zero View Padding

Indicates the value \(0\) for loads of integer values and positive zero for loads of floating point values.

Positive Infinity View Padding

Indicates the value \(\infty\).

Negative Infinity View Padding

Indicates the value \(-\infty\).

NaN View Padding

Indicates the value NaN.

Default View Padding

Indicates a default view padding value that is used when one is not explicitly specified. The default view padding is zero view padding.

Memory Model

The purpose of a memory model is to constrain the set of observable values exhibited by memory operations in a multi-threaded program. The Tile C++ memory model is based on the C++ standard memory model with modifications.

Informally, a Tile C++ program behaves as if each tile block is executed by a separate thread except that multiple load and store operations dispatched by a single Tile C++ API may execute simultaneously on different threads. Additionally, the visibility of atomic memory operations are limited to the threads of the specified thread scope.

Example

In the following example, a call to one of the store APIs will dispatch multiple memory operations that write to a single location. In the ct::atomic_store() invocation, no undefined behavior occurs because the writes are atomic and occur in the same thread scope (tile block scope). In the ct::store() invocation, undefined behavior occurs because the writes constitute a data race.

namespace ct = ::cuda::tiles;
using i32x4 = ct::tile<int, ct::shape<4>>;

int x = 0;

auto ptrs = ct::full<ct::tile<int*, ct::shape<4>>>(&x);

// No UB
ct::atomic_store(ptrs, ct::iota<i32x4>(),
                 ct::memory_order_relaxed_t{},
                 ct::thread_scope_block_t{});

// Value of 'x' is either 0, 1, 2, or 3 at this point
printf("%i\n", x);

// UB, due to data race
ct::store(ptrs, ct::iota<i32x4>());

The modifications to the C++ memory model are described below:

Tile Threads

A single thread 12 of execution is generated for each tile block that is launched by a tile kernel. These threads provide the parallel forward progress guarantee 13.

Memory Operation

A memory operation is an evaluation executed by some thread that modifies a memory location 14. Memory operations are classified as either a read, a write, or a read-write. They may be further classified as either weak or strong. Strong memory operations are endowed with a memory order and thread scope.

The following constructs generate one or more memory operations:

  1. A read or write through a glvalue generates one weak read (respectively write) memory operation on the memory location of the glvalue. The memory operation executes in the thread which evaluates the glvalue access.

  2. A call to certain Tile C++ API functions may generate one or more memory operations. The relevant API function specifies:

    1. How many memory operations are generated by a single invocation of the API.

    2. The memory locations accessed by the generated memory operations.

    3. Whether the operations are read, write, or read-write.

    4. Whether the operations are weak or strong.

    5. If the operations are strong, what their memory order and thread scope is.

When a Tile C++ API generates memory operations, each memory operation is evaluated on a separate thread of execution. The beginning of the invocation of the API synchronizes with 15 the beginning of the evaluation of the memory operation in each thread. The end of the evaluation of the memory operation in each thread synchronizes with 15 the end of the API invocation.

Memory Order

The memory order of a strong memory operation corresponds to a C++ memory order 18 and indicates how the memory operation may synchronize with 15 other memory operations. Tile C++ supports the following memory orders:

Relaxed Memory Order

The operation does not imply any synchronization relationship.

Release Memory Order

Certain memory accesses occurring prior to this operation may not be reordered after it. Applicable only to write and read-write operations.

Acquire Memory Order

Certain memory accesses occurring after this operation may not be reordered before it. Applicable only to read and read-write operations.

Acquire Release Memory Order

Certain memory operations may not be reordered before or after this operation. Applicable only to read-write operations.

The relaxed and acquire are known as read memory orders. The relaxed and release are known as write memory orders.

APIs for selecting a memory order are documented in ct::memory_order.

Thread Scope

The thread scope of a strong memory operation \(A\) constrains the strong memory operations which may synchronize with \(A\) as well as the data races that \(A\) may participate in.

Tile Block Scope

The set of threads that execute the memory operations generated by the the Tile C++ API that generated \(A\).

Device Scope

The threads of the GPU device in which \(A\) executes including threads from other SIMT or Tile kernels.

System Scope

The threads of the whole system, including all GPU devices and the CPU host.

APIs for selecting a thread scope are documented in ct::thread_scope.

Default Thread Scope

A distinguished thread scope called the default thread scope specifies the scope used for invocations of Tile C++ atomic memory APIs when a thread scope is not explicitly specified. The default thread scope is system scope.

Strongly Scoped Operations

Two memory operations \(A\) and \(B\) are said to be strongly scoped if they access the same memory location 14, they are both strong memory operations, and \(A\)’s thread scope contains the thread executing \(B\) and \(B\)’s thread scope contains the thread executing \(A\).

Atomic Objects

An object is atomic 20 and has a modification order 19 as specified by the C++ memory model if all memory operations which access that object are strong. The operations which access this object are considered to be atomic operations 21 with the specified memory order 18.

Synchronizes With

Memory operation \(A\) accessing the same atomic object as memory operation \(B\) shall synchronize with 15 \(B\) according to the rules for atomic operations 21 only if \(A\) and \(B\) are strongly scoped.

Data Races

In addition to the situations described in § 6.9.2.2 of [intro.race] ISO/IEC 14882:2024, a data race 16 occurs if two memory operations which are not strongly scoped access the same memory location 14 and neither happens before 17 the other.

Footnotes

1

Within this document the term scalar and scalar type shall refer to this definition and not the term of the same name in the C++ standard.

2(1,2,3,4,5)

See IEEE 754-2019

3(1,2,3,4,5)

See § 4.3 of IEEE 754-2019

4

See § 3.3 of IEEE 754-2019

5(1,2,3,4)

See § 6.8.2.11 [basic.fundamental] of ISO/IEC 14882:2024

6(1,2)

See § 6.8.2.12 [basic.fundamental] of ISO/IEC 14882:2024

7(1,2)

See § 7.3.1.1 [conv.general] of ISO/IEC 14882:2024

8(1,2,3,4)

The standard floating-point conversions are defined in § 7.3.10 [conv.double] of ISO/IEC 14882:2024

9(1,2)

The standard floating-integral conversions are defined in § 7.3.11 [conv.fpint] of ISO/IEC 14882:2024

10

See § 7.4 of [expr.arith.conv] ISO/IEC 14882:2024

11

The integral promotion rules are defined in § 7.3.7 [conv.prom] of ISO/IEC 14882:2024

12

See § 6.9.2.1 [intro.multithread.general] of ISO/IEC 14882:2024

13

See § 6.9.2.3 [forward.progress] of ISO/IEC 14882:2024

14(1,2,3)

See § 6.7.1 [intro.memory] of ISO/IEC 14882:2024

15(1,2,3,4)

See § 6.9.2.2 [intro.race] of ISO/IEC 14882:2024

16

See § 6.9.2.2 [intro.race] of ISO/IEC 14882:2024

17

See § 6.9.2.2 [intro.race] of ISO/IEC 14882:2024

18(1,2)

See § 33.5.4 [atomics.order] of ISO/IEC 14882:2024

19

See § 6.9.2.2 [intro.race] of ISO/IEC 14882:2024

20

See § 6.9.2.2 [intro.race] of ISO/IEC 14882:2024

21(1,2)

See § 33.5.4 [atomics.order] of ISO/IEC 14882:2024

22

See § 18.6 [concepts.object] of ISO/IEC 14882:2024

23(1,2,3,4,5,6,7,8,9)

See § 18.2 [concepts.equality] of ISO/IEC 14882:2024

24

See § 5.4.2 of IEEE 754-2019

25

See § 5.8 of IEEE 754-2019

26

See § 6.8.2 [basic.fundamental] of ISO/IEC 14882:2024