Data Model#

cuTile is an array-based programming model. The fundamental data structure is multidimensional arrays with elements of a single homogeneous type. cuTile Python does not expose pointers, only arrays.

An array-based model was chosen because:

Arrays know their bounds, so accesses can be checked to ensure safety and correctness.
Array-based load/store operations can be efficiently lowered to speed-of-light hardware mechanisms.
Python programmers are used to array-based programming frameworks such as NumPy.
Pointers are not a natural choice for Python.

Within tile code, only the types described in this section are supported.

Global Arrays#

class cuda.tile.Array

A global array (or array) is a container of objects stored in a logical multidimensional space.

Global arrays are always stored in memory. Copying an array does not copy the underlying data.

Global arrays can be used in host code and tile code. They can be kernel parameters.

Any object that implements the DLPack format or the CUDA Array Interface can be used as a global array. Example: CuPy arrays and PyTorch tensors.

Tiles#

class cuda.tile.Tile

A tile array (or tile) is an immutable multidimensional collection of values that is local to a block.

The contents of a tile do not necessarily have a representation in memory. Tiles can be created by loading from global arrays or with factory functions. Tiles can also be stored into global arrays.

Tiles shall not be used in host code; they can only be used in tile code. Tiles shall not be kernel parameters.

Each dimension of a tile shall be a power of 2.

Element & Tile Space#

_images/cutile__indexing__array_shape_12x16__tile_shape_2x4__tile_grid_6x4__dark_background.svg

_images/cutile__indexing__array_shape_12x16__tile_shape_2x4__tile_grid_6x4__light_background.svg

_images/cutile__indexing__array_shape_12x16__tile_shape_4x2__tile_grid_3x8__dark_background.svg

_images/cutile__indexing__array_shape_12x16__tile_shape_4x2__tile_grid_3x8__light_background.svg

The element space of an array is the multidimensional space of elements contained in that array, stored in memory according to a certain layout (row major, column major, etc).

The tile space of an array is the multidimensional space of tiles into that array of a certain tile shape. A tile index (i, j, ...) with shape S refers to the elements of the array that belong to the (i+1)-th, (j+1)-th, … tile.

When accessing the elements of an array using tile indices, the multidimensional memory layout of the array is used. To access the tile space with a different memory layout, use the order parameter of load/store operations.

Shape Broadcasting#

Shape broadcasting allows tiles with different shapes to be combined in arithmetic operations. When performing operations between tiles of different shapes, the smaller tile is automatically extended to match the shape of the larger one, following these rules:

Tiles are aligned by their trailing dimensions.
If the corresponding dimensions have the same size or one of them is 1, they are compatible.
If one tile has fewer dimensions, its shape is padded with 1s on the left.

Broadcasting follows the same semantics as NumPy, which makes code more concise and readable while maintaining computational efficiency.

Data Types#

class cuda.tile.DType#

A data type (or dtype) describes the type of the objects of an array, tile, or operation.

Dtypes determine how values are stored in memory and how operations on those values are performed. Dtypes are immutable.

Dtypes can be used in host code and tile code. They can be kernel parameters.

property bitwidth#: The number of bits in an element of the data type.

property name#: The name of the data type.

cuda.tile.bool_#: A 8-bit arithmetic dtype (True or False).

cuda.tile.uint8#: A 8-bit unsigned integer arithmetic dtype whose values exist on the interval [0, +256].

cuda.tile.uint16#: A 16-bit unsigned integer arithmetic dtype whose values exist on the interval [0, +65,536].

cuda.tile.uint32#: A 32-bit unsigned integer arithmetic dtype whose values exist on the interval [0, +4,294,967,295].

cuda.tile.uint64#: A 64-bit unsigned integer arithmetic dtype whose values exist on the interval [0, +18,446,744,073,709,551,615].

cuda.tile.int8#: A 8-bit signed integer arithmetic dtype whose values exist on the interval [−128, +127].

cuda.tile.int16#: A 16-bit signed integer arithmetic dtype whose values exist on the interval [−32,768, +32,767].

cuda.tile.int32#: A 32-bit signed integer arithmetic dtype whose values exist on the interval [−2,147,483,648, +2,147,483,647].

cuda.tile.int64#: A 64-bit signed integer arithmetic dtype whose values exist on the interval [−9,223,372,036,854,775,808, +9,223,372,036,854,775,807].

cuda.tile.float16#: A IEEE 754 half-precision (16-bit) binary floating-point arithmetic dtype (see IEEE 754-2019).

cuda.tile.float32#: A IEEE 754 single-precision (32-bit) binary floating-point arithmetic dtype (see IEEE 754-2019).

cuda.tile.float64#: A IEEE 754 double-precision (64-bit) binary floating-point arithmetic dtype (see IEEE 754-2019).

cuda.tile.bfloat16#: A 16-bit floating-point arithmetic dtype with 1 sign bit, 8 exponent bits, and 7 mantissa bits.

cuda.tile.tfloat32#: A 32-bit tensor floating-point arithmetic dtype with 1 sign bit, 8 exponent bits, and 10 mantissa bits (19-bit representation stored in 32-bit container).

cuda.tile.float8_e4m3fn#: A 8-bit floating-point arithmetic dtype with 1 sign bit, 4 exponent bits, and 3 mantissa bits.

cuda.tile.float8_e5m2#: A 8-bit floating-point arithmetic dtype with 1 sign bit, 5 exponent bits, and 2 mantissa bits.

Numeric & Arithmetic Data Types#

A numeric data type represents numbers. An arithmetic data type is a numeric data type that supports general arithmetic operations such as addition, subtraction, multiplication, and division.

Arithmetic Promotion#

Binary operations can be performed on two tile or scalar operands of different numeric dtypes.

When both operands are loosely typed numeric constants, then the result is also a loosely typed constant: for example, 5 + 7 is a loosely typed integral constant 12, and 5 + 3.0 is a loosely typed floating-point constant 8.0.

If any of the operands is not a loosely typed numeric constant, then both are promoted to a common dtype using the following process:

Each operand is classified into one of the three categories: boolean, integral, or floating-point. The categories are ordered as follows: boolean < integral < floating-point.
If either operand is a loosely typed numeric constant, a concrete dtype is picked for it: integral constants are treated as int32, int64, or uint64, depending on the value; floating-point constants are treated as float32.
If one of the two operands has a higher category than the other, then its concrete dtype is chosen as the common dtype.
If both operands are of the same category, but one of them is a loosely typed numeric constant, then the other operand’s dtype is picked as the common dtype.
Otherwise, the common dtype is computed according to the table below.

	b1	u8	u16	u32	u64	i8	i16	i32	i64	f16	f32	f64	bf	tf32	f8e4m3fn	f8e5m2
b1	b1	u8	u16	u32	u64	i8	i16	i32	i64	f16	f32	f64	bf	ERR	ERR	ERR
u8	u8	u8	u16	u32	u64	ERR	ERR	ERR	ERR	f16	f32	f64	bf	ERR	ERR	ERR
u16	u16	u16	u16	u32	u32	ERR	ERR	ERR	ERR	f16	f32	f64	bf	ERR	ERR	ERR
u32	u32	u32	u32	u32	u64	ERR	ERR	ERR	ERR	f16	f32	f64	bf	ERR	ERR	ERR
u64	u64	u64	u64	u64	u64	ERR	ERR	ERR	ERR	f16	f32	f64	bf	ERR	ERR	ERR
i8	i8	ERR	ERR	ERR	ERR	i8	i16	i32	i64	f16	f32	f64	bf	ERR	ERR	ERR
i16	i16	ERR	ERR	ERR	ERR	i16	i16	i32	i64	f16	f32	f64	bf	ERR	ERR	ERR
i32	i32	ERR	ERR	ERR	ERR	i32	i32	i32	i64	f16	f32	f64	bf	ERR	ERR	ERR
i64	i64	ERR	ERR	ERR	ERR	i64	i64	i64	i64	f16	f32	f64	bf	ERR	ERR	ERR
f16	f16	f16	f16	f16	f16	f16	f16	f16	f16	f16	f32	f64	ERR	ERR	ERR	ERR
f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f32	f64	f32	ERR	ERR	ERR
f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	ERR	ERR	ERR
bf	bf	bf	bf	bf	bf	bf	bf	bf	bf	ERR	f32	f64	bf	ERR	ERR	ERR
tf32	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	tf32	ERR	ERR
f8e4m3fn	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	f8e4m3fn	ERR
f8e5m2	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	ERR	f8e5m2

Legend:

b1: bool_
u8: uint8
u16: uint16
u32: uint32
u64: uint64
i8: int8
i16: int16
i32: int32
i64: int64
f16: float16
f32: float32
f64: float64
bf: bfloat16
tf32: tfloat32
f8e4m3fn: float8_e4m3fn
f8e5m2: float8_e5m2
ERR: Implicit promotion between these types is not supported

Scalars#

A scalar is a single immutable value of a specific data type. A scalar and 0D-tile can be used interchangably in a tile kernel. They can also be kernel parameters.

Typing of a scalar has the following rules:

Constant scalars are loosely typed by default, for example, a literal 2 or a constant property like Tile.ndim, Tile.shape, or Array.ndim.
Array.shape and Array.stride are not constant by default and has default int type int32. Using default int32 makes kernel more performant at the cost of limiting max representable shape. This limitation will be lifted in the near future.

Tuples#

Tuples can be used in tile code. They cannot be kernel parameters.

Rounding Modes#

class cuda.tile.RoundingMode#

Rounding mode for floating-point operations.

RN = 'nearest_even'#: Rounds the nearest (ties to even).

RZ = 'zero'#: Round towards zero (truncate).

RM = 'negative_inf'#: Round towards negative infinity.

RP = 'positive_inf'#: Round towards positive infinity.

FULL = 'full'#: Full precision rounding mode.

APPROX = 'approx'#: Approximate rounding mode.

RZI = 'nearest_int_to_zero'#: Round towards zero to the nearest integer.

Padding Modes#

class cuda.tile.PaddingMode#

Padding mode for load operation.

UNDETERMINED = 'undetermined'#: The padding value is not determined.

ZERO = 'zero'#: The padding value is zero.

NEG_ZERO = 'neg_zero'#: The padding value is negative zero.

NAN = 'nan'#: The padding value is NaN.

POS_INF = 'pos_inf'#: The padding value is positive infinity.

NEG_INF = 'neg_inf'#: The padding value is negative infinity.