5. Type System#

All values and operations in Tile IR are statically typed. This section defines Tile IR’s types, as well as their equivalence, layouts, and other type system details that may be relevant for DSL and compiler authors.

Notably Tile IR is tensor valued: all values are tensors. We have two concrete tensor types: tile, a pure tensor value, and view, a structured pointer to a tensor in memory. Additionally, we use element types in the formulation of our type system. They do not describe a value on their own.

5.1. Element Types#

Element types are the native data types supported by Tile IR. By themselves they do not describe a value. As Tile IR operates over tensors, these types describe the hardware accelerated, primitive values that can be contained by a tensor. They specify how a sequence of bits are to be interpreted. Each element type has a size associated with it that represents the number of bits required to represent it.

Note

Note that this is different from a potential storage size, which is specified by the data layout of the tensor which contains these values.

Element types come in two flavors, general purpose fundamental types that come without restriction, and specialized alternative types which each come with a set of restrictions.

5.1.1. Fundamental Types#

Tile IR supports a set of general purpose integer and floating-point types that are supported by all operations and have no restrictions. These can be contained in arbitrary rank, and shape tensors, and 0-rank values of this type can be treated as scalars.

Fundamental Types#

Type

Sizes

Description

i1, i8, i16, i32, i64

1, 8, 16, 32, 64

signless integer type of specified size

f16, f32, f64

16, 32, 64

IEEE floating-point type of specified size

Primary elemental types are supported in all arithmetic operations.

Warning

Integer types are signless, i.e., the type does not encode whether the represented value is to be interpreted as a signed or unsigned value. For operations where this distinction is semantically meaningful signedness is controlled via flags on each arithmetic operation.

5.1.2. Alternative Types#

Tile IR also supports a set of non-standard but hardware accelerated floating-point types. Due to the nature of these types and hardware they each come with a set of restrictions.

Alternative Types#

Type

Size

Description

tf32

32

floating-point format with 8 bits for exponent and 10 bits for mantissa. Storage size is 4 bytes with 4-byte alignment

bf16

16

floating-point format with 8 bits for exponent and 7 bits for mantissa

e4m3

8

floating-point format with 4 bits for exponent and 3 bits for mantissa

e5m2

8

floating-point format with 5 bits for exponent and 2 bits for mantissa

Tensors of these types may be created, manipulated and loaded and stored from global memory, but certain computations on them are restricted.

5.1.3. Floating-Point Conversion Semantics#

When converting values to a floating-point type (via cuda_tile.ftof or cuda_tile.itof), the behavior for out-of-finite-range values and special values depends on the target type.

f16, f32, f64 (IEEE types) and bf16, tf32 (IEEE-like types): the closest representable value is selected according to the specified rounding mode. This may produce Inf when the source value exceeds the target’s finite range. NaN values are preserved.

e4m3, e5m2 (low-precision float types): use saturation-to-finite (satfinite) semantics, meaning the closest representable finite value is selected according to the specified rounding mode. Inf is never produced, even if the source value was Inf.

Table Floating-Point Conversion: Special Value and Saturation Behavior enumerates the behavior for the different supported floating-point types when converting various “corner case” values.

Floating-Point Conversion: Special Value and Saturation Behavior#

Target Type

Out-of-Range Finite Source

Source is ±Inf

Source is NaN

f16, f32, f64, bf16, tf32

Nearest representable value (may produce Inf)

±Inf

NaN

e5m2

Nearest representable finite value (±MAX_NORM)

±MAX_NORM

NaN

e4m3

Nearest representable finite value (±MAX_NORM)

±MAX_NORM

+MAX_NORM

Note

The e4m3 type does not support NaN; NaN inputs are converted to positive MAX_NORM.

Note

When an operation specifies both a rounding mode (including the approx and full variants of transcendentals such as cuda_tile.exp and cuda_tile.tanh) and FTZ (flush-to-zero) handling of subnormals, the rounding step is applied first and FTZ is applied to the rounded result.

5.2. Sub-Byte Types#

Tile IR supports element types whose storage size is smaller than a byte. Currently this is f4E2M1FN (4-bit floating-point). Because memory is byte-addressable, sub-byte element values must be packed together into bytes for storage and accessed in groups whose total size is an integral number of bytes.

5.2.1. Load and Store Requirements#

Loads and stores must refer to contiguous, byte-aligned regions of memory. For 4-bit elements, this means always having contiguous pairs of elements.

For view-based loads and stores, the view must have at least one dimension with unit stride, and the extent of that dimension must be a multiple of 2 (in general, the multiple needed to make the access byte-aligned).

For pointer-based loads and stores, contiguity is inferred from how the pointer tile was constructed. When the pointer tile is built from a base pointer plus statically known offsets, such as the offsets produced by cuda_tile.iota, the compiler can prove that consecutive lanes address consecutive nibbles within the same byte and emits a packed access. If the required contiguity cannot be established statically, the operation is rejected by the compiler.

5.2.2. Packing Order#

Sub-byte elements are packed densely with little-endian nibble order: the element at the lower index occupies the lower bits of the byte. For 4-bit elements, elements i and i+1 (where i is a multiple of 2) are packed into a single byte such that element i occupies bits 3...0 and element i+1 occupies bits 7...4.

For example, a tensor_view<2xf4E2M1FN, strides=[1]> holding the values [0.5, 1.5] is stored as the single byte:

Bit position:    7   6   5   4   3   2   1   0
               +---+---+---+---+---+---+---+---+
               | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
               +---+---+---+---+---+---+---+---+
               |     1.5       |     0.5       |
               | upper nibble  | lower nibble  |

where 0.5 has the f4E2M1FN bit pattern 0001 and 1.5 has the bit pattern 0011.

The same packing order applies when a sub-byte tile is materialized from a byte tile via cuda_tile.unpack, or written back via cuda_tile.pack: the lower-indexed element of each pair occupies the lower bits of the corresponding byte.

5.3. Pointers#

Pointers, or values which contain memory addresses, are typed as pointers to a specific pointee type. A pointer points to a location in memory; the data at that location will be interpreted as being of the pointee element type when loaded.

Pointer arithmetic assumes the storage size of the pointee type for offset computations. Pointer types are parameterized by element types, i.e., nested pointer types are not supported. For details about converting between different pointer types, or integers, see cuda_tile.bitcast.

5.4. Tensor Types#

A tensor is a multi-dimensional, rectangular array described by a shape and element type. The shape is a vector that describes the number of elements across each axis of the tensor. The length of said vector describes the rank of the tensor, i.e., the number of its dimensions. All tensors in Tile IR have a statically known rank. Tile IR has two kinds of tensor types tiles and views.

5.4.1. Tile Type#

A tile is a tensor with static shape, i.e., the extent across each dimension is known at compile time. See Syntax for the tile<MxNxKxE> assembly syntax.

Note

In Tile IR, all data values to be operated on are expressed as a tile. In particular, even scalar values are represented as a tile of rank zero.

Note

A tile of pointers is typically used to load from or store to a batch of locations. The tile of pointers defines the shape of the values that are loaded or stored, with one pointer element mapping to one scalar value loaded or stored. It does not imply any structure on the locations themselves. For example, two consecutive pointers in the tile may not point to consecutive locations in memory. The same location may even be present multiple times within a single tile of pointers. See Memory Model for a discussion of implications.

5.4.2. Tensor View#

It is common that data in global memory follows a strided structure. For example, the widely adopted row-major or column-major layouts are strided. It is beneficial for the compiler to be aware of the strided layout of data in memory. Tile IR features a tensor view type to describe such structure in global memory.

Conceptually, a tensor view type describes an abstracted tensor of pointers. Like a regular tensor, it is described by a shape and the type of elements it points to. It in addition has a vector of striding factors that describe the relative position of locations the elements of the tensor view point to. If an element is \(d\) elements apart in dimension \(i\) of the tensor view, the corresponding locations in memory will be \(d * stride_i\) elements away. This information can be used by the compiler to reason about access patterns and layouts of data in memory.

Values of type tensor view are typically never materialized in memory. Rather, they are stored as a compact description of a base-location, shape and striding factors. From this information, a tensor of pointers corresponding to the full view value can be computed using the following formula

\[elem_{[i_0, ..., i_n]} = baseptr + \sum_{m=0}^{n} i_m * s_m\]

where the \(s_m\) are the striding factors of the tensor view and \(baseptr\) is the start address in global memory.

A tensor view supports dynamic extents in its shape and stride vectors; these are bound at runtime when the view is constructed using a cuda_tile.make_tensor_view operation. A tensor view cannot be directly used to access memory — it first needs to be divided into tiles of static size. See Subview Types for options to do so.

5.4.3. Subview Types#

A tensor view is often too large to be loaded as a single tile for processing. Instead, it must first be subdivided into tiles. In Tile IR this is expressed using subview types.

Subviews describe a mapping from an index space to a space of statically-sized tiles loaded from a tensor view. They define the necessary index computations performed by a cuda_tile.load_view_tko and cuda_tile.store_view_tko when accessing elements from a tensor view.

Tile IR currently provides a single subview for partitioning a view into a grid of non-overlapping tiles but is designed to support additional subview types in the future that support different indexing patterns.

Partition View#

partition_view is a subview type that represents a view partitioned into a grid of non-overlapping tiles. The index space in this case is the position of the tile in the grid. Partition views are particularly useful in patterns like matrix multiplication, where a large tensor in global memory is traversed as non-overlapping tiles to form the final result. The partition view structure is created using the cuda_tile.make_partition_view constructor.

Formally, given a tensor view with shape \([S_0, \ldots, S_n]\) and strides \([st_0, \ldots, st_n]\) and a partition view with tile size \([T_0, \ldots, T_n]\), a load or store at position \([I_0, \dots, I_n]\) will load the elements at location

\[location_{[i_0, \ldots, i_n]} = baseptr + \sum_{m=0}^{n} I_m \cdot ceildiv(S_m, T_m) \cdot st_m\]

The index space of a partition view covers all tiles that would contain at least one element within the bounds of the underlying tensor view. For example, with a tensor view of shape (64, 256) and a partition view of tile shape (128, 128), the index space of the partition will have a shape of (1, 2) (and not, for example, (0, 2)). Formally, the index space shape \([N_0, \ldots, N_n]\) is defined as

\[N_k = ceildiv(S_k, T_k)\]

5.5. Type Equivalence#

Tile IR does not provide means to name types. Equivalence of types hence is a purely structural property: Two types are considered equal if they are structurally identical.

Note that some types form a natural subtype relationship. Types with dynamic shapes and strides like view cover the values that an identical type with all dynamic shapes and strides substituted with static values would cover. However, we consider these types distinct in Tile IR.

5.6. Type Reference#

5.6.1. cuda_tile.gather_scatter_view#

Gather/scatter view type

13.3

Parameters#

  • tile_shape - tile shape 13.3

  • tensor_view - tensor view 13.3

  • sparse_dim - sparse dimension 13.3

  • padding_value - padding value 13.3

Description#

!cuda_tile.gather_scatter_view represents a view into a tensor_view where one dimension is accessed using a sparse gather/scatter pattern while the remaining dimensions are accessed contiguously.

!cuda_tile.gather_scatter_view has the following specification:

  • Index space rank: as many dimensions as the underlying tensor_view.

  • Tile sizes: as specified by tile_shape.

It consists of:

  • tile_shape: an integer array that describes the shape of the tiles in the view.

  • tensor_view: the type of the tensor_view into which the view is looking.

  • sparse_dim: a non-negative integer that specifies which dimension to gather/scatter over. Must be strictly less than the rank of the tensor_view.

  • padding_value: an optional enum, specifying the value that should be used for out-of-bounds accesses (loads) into the tensor_view.

Supported padding values include:

  • zero: zero

  • neg_zero: negative zero

  • nan: NaN

  • pos_inf: positive infinity

  • neg_inf: negative infinity

Note

Only power-of-two tile dimensions are supported.

When loading or storing via a gather_scatter_view, the index at position sparse_dim must be a 1D tile whose size equals tile_shape[sparse_dim]. Each element of this tile is an independent index into the underlying tensor_view along that dimension. The loaded or stored tile will be the concatenation of the elements in the rows selected by the sparse dimension index. All other indices must be scalar tiles of the same element type, each choosing the offset in the underlying tensor_view along that dimension and selecting a contiguous block of tile_shape[dim] elements.

Examples:

// (1) A 1D gather/scatter view over an 8xf32 tensor_view with a tile
// size of 4 and sparse_dim=0. Since the view is 1D, the only index is
// the gather index: a size-4 1D tile of element indices. The table below
// visualizes a load with gather_indices=[6, 1, 4, 3]. Gathered elements
// are indicated with their index in the single loaded tile. ( ) marks
// elements that are not gathered.
//
//               8
// ←─────────────────────────────→
// ( ) (1) ( ) (3) (2) ( ) (0) ( )
//
// In pseudocode, we can say that loading with gather_indices=[6, 1, 4, 3]
// is equivalent to the following:
//   result = [tensor_view[6], tensor_view[1], tensor_view[4], tensor_view[3]]
//
!gsv_1d= !cuda_tile.gather_scatter_view<
  tile=(4),
  tensor_view<8xf32, strides=[1]>,
  sparse_dim=0
>

// (2) A 2D gather/scatter view over an 8x8xf32 tensor_view with tile
// size 4x4 and sparse_dim=0. The first index (at sparse_dim) is a size-4
// 1-D tile of row indices; the second index is a scalar selecting a
// block of 4 contiguous columns. The table below visualizes a load
// with gather_indices=[5, 1, 7, 3] and col_idx=0. Each number
// indicates the result row, and ( ) marks elements not gathered.
//
//                      8
//       ←─────────────────────────────→
//     ↑ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )    row 0: not gathered
//     │ (1) (1) (1) (1) ( ) ( ) ( ) ( )    row 1: → result row 1
//     │ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )    row 2: not gathered
//   8 │ (3) (3) (3) (3) ( ) ( ) ( ) ( )    row 3: → result row 3
//     │ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )    row 4: not gathered
//     │ (0) (0) (0) (0) ( ) ( ) ( ) ( )    row 5: → result row 0
//     │ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )    row 6: not gathered
//     ↓ (2) (2) (2) (2) ( ) ( ) ( ) ( )    row 7: → result row 2
//
!gsv_2d= !cuda_tile.gather_scatter_view<
  tile=(4x4),
  tensor_view<8x8xf32, strides=[8, 1]>,
  sparse_dim=0
>

// (3) A larger gather/scatter view with zero-padding for out-of-bounds
// accesses. If the index for dimension 1 is in the range [241, 255],
// the out-of-bounds elements will be filled with zero. If no padding value
// were set, the values would be unspecified. Out of bounds sparse dimension
// indices, in dimension 0, will yield a row of zero (the padding value).
// Likewise, the values of the entire row are unspecified if no padding
// value is set.
//
!gsv_2d_padded= !cuda_tile.gather_scatter_view<
  tile=(8x16),
  padding_value = zero,
  tensor_view<128x256xf32, strides=[256, 1]>,
  sparse_dim=0
>

// (4) Gather along the second dimension (sparse_dim=1). Here the
// first index is a scalar selecting which block of 8 rows, and
// the second index is a 1D tile of 16 column indices.
//
!gsv_2d_col= !cuda_tile.gather_scatter_view<
  tile=(8x16),
  tensor_view<128x256xf32, strides=[256, 1]>,
  sparse_dim=1
>

// (5) Scatter (store) using the same view type as example (2).
// Given a 4x4 tile to store and scatter_indices=[5, 1, 7, 3]
// with col_idx=0, row 0 of the tile is written to row 5 of the
// tensor_view, row 1 to row 1, row 2 to row 7, and row 3 to
// row 3. Columns 0-3 are written contiguously. The table below
// shows which tensor_view rows are written [*] and which are
// untouched ( ).
//
//                      8
//       ←─────────────────────────────→
//     ↑ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )    row 0: untouched
//     │ [1] [1] [1] [1] ( ) ( ) ( ) ( )    row 1: ← tile row 1
//     │ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )    row 2: untouched
//   8 │ [3] [3] [3] [3] ( ) ( ) ( ) ( )    row 3: ← tile row 3
//     │ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )    row 4: untouched
//     │ [0] [0] [0] [0] ( ) ( ) ( ) ( )    row 5: ← tile row 0
//     │ ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )    row 6: untouched
//     ↓ [2] [2] [2] [2] ( ) ( ) ( ) ( )    row 7: ← tile row 2
//
// In pseudocode:
//   tensor_view[5, 0:4] = tile[0, :]
//   tensor_view[1, 0:4] = tile[1, :]
//   tensor_view[7, 0:4] = tile[2, :]
//   tensor_view[3, 0:4] = tile[3, :]
//
!gsv_2d_scatter= !cuda_tile.gather_scatter_view<
  tile=(4x4),
  tensor_view<8x8xf32, strides=[8, 1]>,
  sparse_dim=0
>

The gather/scatter view index space shape equals the shape of the underlying tensor_view. In the above examples, !gsv_1d has an index space shape of 8, !gsv_2d has an index space shape of 8x8, and !gsv_2d_padded has an index space shape of 128x256.

Indices accessing the gather_scatter_view on non-sparse dimensions must be in-bounds, though the accessed tile itself may run out-of-bounds. Individual gather/scatter indices (at sparse_dim) may reference rows that are out-of-bounds of the underlying tensor_view. When tiles are partially out of bounds, operations follow the following semantics:

  • Load operations: If padding_value is set, out-of-bounds elements yield the padding value. If not set, out-of-bounds elements yield unspecified values.

  • Store operations: Out-of-bounds elements are masked during stores.

Note

Gather/Scatter sparse dimension indices do not need to be unique. Repeated sparse index values in a load operation result in the same row being loaded multiple times. Repeated sparse index values in a store operation result in undefined values in the repeated row.

5.6.2. cuda_tile.partition_view#

Partition view type

13.1

Parameters#

  • tile_shape - tile shape 13.1

  • tensor_view - tensor view 13.1

  • dim_map - dimension mapping 13.1

  • padding_value - padding value 13.1

Description#

!cuda_tile.partition_view represents a view into a tensor_view where tiles are laid out in a grid pattern across the original tensor_view. The grid is aligned with the start of each dimension and there are no gaps or overlaps between tiles.

!cuda_tile.partition_view has the following specification:

  • Index space rank: as many dimensions as the underlying tensor_view.

  • Tile sizes: as specified by tile_shape.

It consists of:

  • tile_shape: a dense integer array that describes the shape of the tiles in the view.

  • tensor_view: the type of the tensor_view into which the view is looking.

  • dim_map: an integer array that specifies for each tile dimension the corresponding dimension in the underlying tensor_view.

  • padding_value: an optional enum, specifying the value that should be used for out-of-bounds accesses (loads) into the tensor_view.

Supported padding values include:

  • zero: zero

  • neg_zero: negative zero

  • nan: NaN

  • pos_inf: positive infinity

  • neg_inf: negative infinity

Note

Only power-of-two tile dimensions are supported.

Examples:

// (1) A view into a 16xf32 tensor_view with a tile size of 2. The table
// below visualizes for each element of the tensor_view the corresponding
// tile, as indicated by its index.
//
//                               16
// ←─────────────────────────────────────────────────────────────→
// (0) (0) (1) (1) (2) (2) (3) (3) (4) (4) (5) (5) (6) (6) (7) (7)
//
!pv_1d= !cuda_tile.partition_view<
  tile=(2),
  tensor_view<16xf32, strides=[1]>
>

// (2) A view into a 64x16xf32 tensor_view with a tile size of 4x2. By
// convention, in the below table, the Y axis corresponds to the first
// tensor_view dimension and the X axis corresponds to the second one.
//
//                                   16
//       ←────────────────────────────────────────────────────────── ...
//     ↑ (0,0) (0,0) (0,1) (0,1) (0,2) (0,2) (0,3) (0,3) (0,4) (0,4) ...
//     │ (0,0) (0,0) (0,1) (0,1) (0,2) (0,2) (0,3) (0,3) (0,4) (0,4) ...
//     │ (0,0) (0,0) (0,1) (0,1) (0,2) (0,2) (0,3) (0,3) (0,4) (0,4) ...
//     │ (0,0) (0,0) (0,1) (0,1) (0,2) (0,2) (0,3) (0,3) (0,4) (0,4) ...
//  64 │ (1,0) (1,0) (1,1) (1,1) (1,2) (1,2) (1,3) (1,3) (1,4) (1,4) ...
//     │ (1,0) (1,0) (1,1) (1,1) (1,2) (1,2) (1,3) (1,3) (1,4) (1,4) ...
//     │ (1,0) (1,0) (1,1) (1,1) (1,2) (1,2) (1,3) (1,3) (1,4) (1,4) ...
//     │ (1,0) (1,0) (1,1) (1,1) (1,2) (1,2) (1,3) (1,3) (1,4) (1,4) ...
//     │ (2,0) (2,0) (2,1) (2,1) (2,2) (2,2) (2,3) (2,3) (2,4) (2,4) ...
//    ...
//
!pv_2d= !cuda_tile.partition_view<
  tile=(4x2),
  tensor_view<64x16xf32, strides=[16, 1]>
>

// (3) A view into a 64x16xf32 tensor_view with a tile size of 4x2. The
// first tile dimension is mapped to the second tensor_view dimension. The
// second tile dimension is mapped to the first tensor_view dimension.
//
//                                   16
//       ←────────────────────────────────────────────────────────── ...
//     ↑ (0,0) (0,0) (0,0) (0,0) (1,0) (1,0) (1,0) (1,0) (2,0) (2,0) ...
//     │ (0,0) (0,0) (0,0) (0,0) (1,0) (1,0) (1,0) (1,0) (2,0) (2,0) ...
//     │ (0,1) (0,1) (0,1) (0,1) (1,1) (1,1) (1,1) (1,1) (2,1) (2,1) ...
//  64 │ (0,1) (0,1) (0,1) (0,1) (1,1) (1,1) (1,1) (1,1) (2,1) (2,1) ...
//     │ (0,2) (0,2) (0,2) (0,2) (1,2) (1,2) (1,2) (1,2) (2,2) (2,2) ...
//     │ (0,2) (0,2) (0,2) (0,2) (1,2) (1,2) (1,2) (1,2) (2,2) (2,2) ...
//    ...
//
!pv_2d_transposed= !cuda_tile.partition_view<
  tile=(4x2),
  tensor_view<64x16xf32, strides=[16, 1]>,
  dim_map=[1, 0]
>

// Note: A load from partition_view with non-default dim_map is
// semantically identical to a load with default dim_map followed by a
// permutation.
//
// %0 = load_view_tko ... %view[%a, %b]
//     : partition_view<tile=(4x2), ..., dim_map=[1, 0]> -> tile<4x2xf32>
//
// Is identical to:
//
// %0 = load_view_tko ... %view[%b, %a]
//     : partition_view<tile=(2x4), ..., dim_map=[0, 1]> -> tile<2x4xf32>
// %1 = permute %0 [1, 0] : tile<2x4xf32> -> tile<4x2xf32>

The partition view index space is determined by the tile_shape, the tensor_view shape and dim_map. In the above examples, !pv_2d has an index space shape of 16x8, whereas !pv_2d_transposed has an index space shape of 4x32.

Indices into the partition view must lie within the index space of the partition view. Otherwise, the behavior is undefined. For example, loading the tile at index (0, 8) from a partition view of type !pv_2d above is invalid, because the maximum index in dimension 1 is 7.

While partition view indices must be in-bounds, the accessed tile itself may run out-of-bounds. I.e., it may only partially overlap with the underlying tensor_view. Tiles cannot be fully outside of the underlying tensor_view because that would require the partition view indices to lie outside the partition view index space. When tiles are partially out of bounds, operations follow the following semantics:

  • Load operations: If padding_value is set, out-of-bounds tile elements yield the padding value. If not set, out-of-bounds elements yield unspecified values.

  • Store operations: Out-of-bounds tile elements are masked during stores.

Example:

// (4) A view into a 8x2xf32 tensor_view with a tile size of 1x4 and NaN
// padding. The right half of the below table consists of padded NaN
// values.
//
//            2
//       ←─────────→
//     ↑ (0,0) (0,0) (0,0) (0,0)
//     │ (1,0) (1,0) (1,0) (1,0)
//   8 │ (2,0) (2,0) (2,0) (2,0)
//     │ (3,0) (3,0) (3,0) (3,0)
//     │ (4,0) (4,0) (4,0) (4,0)
//    ...
//
!pv_2d_padded= !cuda_tile.partition_view<
  tile=(1x4),
  padding_value = nan,
  tensor_view<8x2xf32, strides=[2,1]>,
>

5.6.3. cuda_tile.ptr#

Pointer type

13.1

Parameters#

  • pointeeType - f16 or bf16 or f32 or tf32 or f64 or f8E4M3FN or f8E5M2 or f8E8M0FNU or f4E2M1FN or i1 or i8 or i16 or i32 or i64 13.1

Description#

An element pointer type $pointerType represents a single location in global device memory. Pointer types are typed, i.e., they carry the type they point to. Any numeric type can be used as pointee type.

5.6.4. cuda_tile.strided_view#

Strided view type

13.3

Parameters#

  • tile_shape - tile shape 13.3

  • traversal_strides - traversal strides 13.3

  • tensor_view - tensor view 13.3

  • dim_map - dimension mapping 13.3

  • padding_value - padding value 13.3

Description#

!cuda_tile.strided_view represents a view into a tensor_view where tiles are laid out in a grid pattern across the original tensor_view. The grid is aligned with the start of each dimension, but the grid pattern’s striding factor is parametric, allowing interleaved or overlapping tiles.

!cuda_tile.strided_view has the following specification:

  • Index space rank: as many dimensions as the underlying tensor_view.

  • Tile sizes: as specified by tile_shape.

It consists of:

  • tile_shape: an integer array that describes the shape of the tiles in the view.

  • traversal_strides: an integer array that describes the traversal strides when traversing the underlying tensor_view. For example, if the shape of tiles is 2x2 and the traversal strides are [3,3], the tensor_view will be traversed with a gap of one element between each tile. Alternatively, if the traversal strides are [1,1], the tensor_view will be traversed with an overlapping sliding window, advancing by one element for each tile.

  • tensor_view: the type of the tensor_view into which the view is looking.

  • dim_map: an integer array that specifies for each tile dimension the corresponding dimension in the underlying tensor_view.

  • padding_value: an optional enum, specifying the value that should be used for out-of-bounds accesses (loads) into the tensor_view.

Supported padding values include:

  • zero: zero

  • neg_zero: negative zero

  • nan: NaN

  • pos_inf: positive infinity

  • neg_inf: negative infinity

Note

Only power-of-two tile dimensions are supported. In contrast, traversal strides can be any strictly positive value.

Examples:

// (1) A view into a 16xf32 tensor_view with a tile size of 2 and a
// traversal stride of 2. This behavior is identical to PartitionView. The
// table below visualizes for each element of the tensor_view the
// corresponding tile, as indicated by its index.
//
//                               16
// ←─────────────────────────────────────────────────────────────→
// (0) (0) (1) (1) (2) (2) (3) (3) (4) (4) (5) (5) (6) (6) (7) (7)
//
!sv_1d_tra2= !cuda_tile.strided_view<
  tile=(2),
  traversal_strides=[2],
  tensor_view<16xf32, strides=[1]>
>

// (2) A view into a 16xf32 tensor_view with a tile size of 2 and a
// traversal stride of 3. This time, one out of three elements are
// skipped, as the stride moves three elements while the tile only
// covers two. Notice that tile 5 is partially out of bounds.
//
//                               16
// ←─────────────────────────────────────────────────────────────→
// (0) (0) ( ) (1) (1) ( ) (2) (2) ( ) (3) (3) ( ) (4) (4) ( ) (5) (5)
//
!sv_1d_tra3= !cuda_tile.strided_view<
  tile=(2),
  traversal_strides=[3],
  tensor_view<16xf32, strides=[1]>
>

// (3) A view into a 8xf32 tensor_view with a tile size of 2 and a
// traversal stride of 1. This time, tiles are overlapping, and the
// same element will be present in multiple tiles. For example,
// element 1 of the tensor view is present both as the second element of
// tile 0, and as the first element tile 1. Tile 7 is partially out of bounds.
// (Note the space is still 1-dimensional; the second row indicates overlap,
// not a second dimension.)
//
//               8
// ←─────────────────────────────→
// (0) (0) (2) (2) (4) (4) (6) (6)
//     (1) (1) (3) (3) (5) (5) (7) (7)
//
!sv_1d_tra1= !cuda_tile.strided_view<
  tile=(2),
  traversal_strides=[1],
  tensor_view<8xf32, strides=[1]>
>

// (4) A view into a 64x16xf32 tensor_view with a tile size of 4x2 and
// traversal strides of [4,3]. Tiles are adjacent in the first dimension,
// but have a gap of one element in the second dimension. By convention,
// in the below table, the Y axis corresponds to the first tensor_view
// dimension and the X axis corresponds to the second one.
//
//                                   16
//       ←────────────────────────────────────────────────────────── ...
//     ↑ (0,0) (0,0) (   ) (0,1) (0,1) (   ) (0,2) (0,2) (   ) (0,3) ...
//     │ (0,0) (0,0) (   ) (0,1) (0,1) (   ) (0,2) (0,2) (   ) (0,3) ...
//     │ (0,0) (0,0) (   ) (0,1) (0,1) (   ) (0,2) (0,2) (   ) (0,3) ...
//     │ (0,0) (0,0) (   ) (0,1) (0,1) (   ) (0,2) (0,2) (   ) (0,3) ...
//  64 │ (1,0) (1,0) (   ) (1,1) (1,1) (   ) (1,2) (1,2) (   ) (1,3) ...
//     │ (1,0) (1,0) (   ) (1,1) (1,1) (   ) (1,2) (1,2) (   ) (1,3) ...
//     │ (1,0) (1,0) (   ) (1,1) (1,1) (   ) (1,2) (1,2) (   ) (1,3) ...
//     │ (1,0) (1,0) (   ) (1,1) (1,1) (   ) (1,2) (1,2) (   ) (1,3) ...
//     │ (2,0) (2,0) (   ) (2,1) (2,1) (   ) (2,2) (2,2) (   ) (2,3) ...
//    ...
//
!sv_2d= !cuda_tile.strided_view<
  tile=(4x2),
  traversal_strides=[4,3]
  tensor_view<64x16xf32, strides=[16, 1]>
>

// (5) A view into a 64x16xf32 tensor_view with a tile size of 4x2 and
// traversal strides of [4,3], as above, but with dim_map=[1, 0].
// (The first tile dimension is mapped to the second tensor_view dimension
// and the second tile dimension is mapped to the first tensor_view dimension.)
// Tiles are adjacent in the second dimension, but have a gap of one element in
// the first dimension.
//
//                                   16
//       ←────────────────────────────────────────────────────────── ...
//     ↑ (0,0) (0,0) (0,0) (0,0) (1,0) (1,0) (1,0) (1,0) (2,0) (2,0) ...
//     │ (0,0) (0,0) (0,0) (0,0) (1,0) (1,0) (1,0) (1,0) (2,0) (2,0) ...
//     | (   ) (   ) (   ) (   ) (   ) (   ) (   ) (   ) (   ) (   ) ...
//     │ (0,1) (0,1) (0,1) (0,1) (1,1) (1,1) (1,1) (1,1) (2,1) (2,1) ...
//  64 │ (0,1) (0,1) (0,1) (0,1) (1,1) (1,1) (1,1) (1,1) (2,1) (2,1) ...
//     | (   ) (   ) (   ) (   ) (   ) (   ) (   ) (   ) (   ) (   ) ...
//     │ (0,2) (0,2) (0,2) (0,2) (1,2) (1,2) (1,2) (1,2) (2,2) (2,2) ...
//     │ (0,2) (0,2) (0,2) (0,2) (1,2) (1,2) (1,2) (1,2) (2,2) (2,2) ...
//     | (   ) (   ) (   ) (   ) (   ) (   ) (   ) (   ) (   ) (   ) ...
//    ...
//
!sv_2d_transposed= !cuda_tile.strided_view<
  tile=(4x2),
  traversal_strides=[4,3],
  tensor_view<64x16xf32, strides=[16, 1]>,
  dim_map=[1, 0]
>

// Note: A load from a strided_view with non-default dim_map is
// semantically identical to a load with default dim_map followed by a
// permutation.
//
// %0 = load_view_tko ... %view[%a, %b]
//     : strided_view<tile=(4x2), traversal_strides=[4,2], ...,
//                    dim_map=[1, 0]> -> tile<4x2xf32>
//
// Is identical to:
//
// %0 = load_view_tko ... %view[%b, %a]
//     : strided_view<tile=(2x4), traversal_strides=[2,4], ...,
//                    dim_map=[0, 1]> -> tile<2x4xf32>
// %1 = permute %0 [1, 0] : tile<2x4xf32> -> tile<4x2xf32>

The strided view index space is determined by the traversal_strides, the tensor_view shape, and dim_map. Partial tiles at the edges of the tensor_view are included in the index space. In the above examples, !sv_1d_tra2 has an index space shape of 8, whereas !sv_1d_tra3 has an index space shape of 6. Similarly, !sv_2d has an index space shape of 16x6, and !sv_2d_transposed has an index space shape of 4x22.

Indices accessing the strided view must lie within the index space of the strided view. Otherwise, the behavior is undefined. For example, loading the tile at index (0, 6) from a strided view of type !sv_2d above is invalid, as the maximum index in dimension 1 is 5.

While strided view indices must be in-bounds, the accessed tile itself may run out-of-bounds. I.e., it may only partially overlap with the underlying tensor_view. Tiles cannot be fully outside of the underlying tensor_view because that would require the strided view indices to lie outside of the strided view index space. When tiles are partially out of bounds, operations follow the following semantics:

  • Load operations: If padding_value is set, out-of-bounds tile elements yield the padding value. If not set, out-of-bounds elements yield unspecified values.

  • Store operations: Out-of-bounds tile elements are masked during stores.

Example:

// (6) A view into a 8x2xf32 tensor_view with a tile size of 1x4 and NaN
// padding. The right half of the below table consists of padded NaN
// values.
//
//            2
//       ←─────────→
//     ↑ (0,0) (0,0) (0,0) (0,0)
//     │ (1,0) (1,0) (1,0) (1,0)
//   8 │ (2,0) (2,0) (2,0) (2,0)
//     │ (3,0) (3,0) (3,0) (3,0)
//     │ (4,0) (4,0) (4,0) (4,0)
//    ...
//
!sv_2d_padded= !cuda_tile.strided_view<
  tile=(1x4),
  traversal_strides=[1,4],
  padding_value = nan,
  tensor_view<8x2xf32, strides=[2,1]>,
>

5.6.5. cuda_tile.tensor_view#

Tensor view type

13.1

Parameters#

  • elementType - f16 or bf16 or f32 or tf32 or f64 or f8E4M3FN or f8E5M2 or f8E8M0FNU or f4E2M1FN or i1 or i8 or i16 or i32 or i64 13.1

  • shape - shape of the tensor view 13.1

  • strides - strides of the tensor view 13.1

Description#

!cuda_tile.tensor_view represents a reference to a tensor in global memory.

It consists of:

  • elementType: the type of the elements in the tensor_view.

  • shape: an integer array that specifies the size of each dimension. Sizes must be strictly positive.

  • strides: an integer array that describes the stride of each dimension. The stride is the number of elements to offset in memory when increasing the corresponding index by one. Strides must be strictly positive.

The shape and the stride can be dynamic on a per-dimension basis. In those cases, their values are printed as ?.

Note

Only power-of-two tile dimensions are supported.

Note

4-bit element types are packed densely in global memory. Tensor views with 4-bit element types must have at least one dimension with a unit stride and such dimensions must have an even number of elements to ensure byte-alignment.

Elements i and i+1 (where i is a multiple of 2) are packed into a byte as follows: element i is stored in bits 3...0 and element i+1 is stored in bits 7...4. This corresponds to a little-endian nibble order.

Example: A tensor_view<2xf4E2M1FN, strides=[1]> with values [0.5, 1.5] is stored as: .. code-block:: mlir

Bit position: 7 6 5 4 3 2 1 0

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 0 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ 1 │ └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ├──── upper nibble ────┤ ├──── lower nibble ────┤ │ 1.5 │ │ 0.5 │ └──────────────────────┘ └──────────────────────┘

Bit pattern of 0.5 in f4E2M1FN: 0001 Bit pattern of 1.5 in f4E2M1FN: 0011

Examples:

// A 512x1024 global memory tensor in row-major (lexicographic) order.
!cuda_tile.tensor_view<512x1024xf16, strides=[1024, 1]>

// A 512x1024 global memory tensor in column-major (colexicographic) order.
!cuda_tile.tensor_view<512x1024xf16, strides=[1, 512]>

// A 512x1024 global memory tensor that enumerates the same memory location
// multiple times.
!cuda_tile.tensor_view<512x1024xf16, strides=[1, 1]>

// A 32x16x32 global memory tensor that is neither row-major nor
// column-major.
!cuda_tile.tensor_view<32x16x32xf16, strides=[512, 1, 16]>

// A ?x? global memory tensor with a unit stride at the last dimension.
!cuda_tile.tensor_view<?x?xf16, strides=[?, 1]>

// A ?x16 global memory tensor with a unit stride at the first dimension.
!cuda_tile.tensor_view<?x16xf32, strides=[1, ?]>

5.6.6. cuda_tile.tile#

Tile type

13.1

Parameters#

  • shape - shape of the tile 13.1

  • elementType - f16 or bf16 or f32 or tf32 or f64 or f8E4M3FN or f8E5M2 or f8E8M0FNU or f4E2M1FN or i1 or i8 or i16 or i32 or i64 or Pointer type or i4 13.1

Description#

A tile is a value type that has a shape and an element type. The shape of the tile must be fully static. All elements of the tile have the same element type. Any numeric type or pointer type can be used as an element type.

Only power-of-two shape dimensions are supported.

Examples: .. code-block:: mlir

!cuda_tile.tile<8x4xf32>

!cuda_tile.tile<4x!cuda_tile.ptr<i8>>

5.6.7. cuda_tile.token#

Cuda tile token type

13.1

Parameters#

No parameters.

Description#

Tokens are not runtime values. Their purpose is to explicitly represent ordering constraints between token-ordered operations executed within a tile block.

Within a tile block, if two operations A -> B are connected by a token chain, the compiler makes sure that it is as-if all effects of A are visible to B. If two operations are not connected by a token chain, the compiler may arbitrarily reorder them as long as SSA dominance is not violated.

Example:

%val = ... : tile<64xf32>
%tok0 = store_ptr_tko weak %ptr, %val -> token
%data, tok1 = load_ptr_tko weak %ptr -> tile<64xf32>, token

The above example has a race condition. The store and load operations operate on the same memory location and the absence of a token dependency means that the effects of the store operation may or may not be visibile to the load operation. (Or they may be partly visible.) To remove the race condition, the store and load operation could be connected by a token.

If two operations A -> B are connected by a token chain, but you can prove that the effects of A are independent from the effects of B, then you can remove the apparent token chain because it will still be as-if the effects of A are visible before the effects of B.

Example:

// Assume that %ptr0 and %ptr1 are non-aliasing (non-overlapping) pointers.
%val = ... : tile<64xf32>
%tok1 = store_ptr_tko weak %ptr0, %val token=%tok0 -> token
%data, tok2 = load_ptr_tko weak %ptr1 token=%tok1 -> tile<64xf32>, token
use(%tok2)

In the above example, %ptr0 and %ptr1 are non-aliasing. Therefore, the effect of the store operation is irrelevant to the load operation. Rewriting the IR as follows preserves the semantics:

%val = ... : tile<64xf32>
%tok1 = store_ptr_tko weak %ptr0, %val token=%tok0 -> token
%data, tok2 = load_ptr_tko weak %ptr1 token=%tok0 -> tile<64xf32>, token
%tok3 = join_tokens %tok1, %tok2 : token
use(%tok3)

Note: This is only possible because of the non-overlapping pointers, and because the memory operations are both weak, so no external thread could observe the reordering either.