cutlass.cute#

class cutlass.cute.ComposedLayout(*args: Any, **kwargs: Any)#

Bases: Value

ComposedLayout represents the functional composition of layouts in CuTe.

A ComposedLayout is formed by the composition of three components: inner o offset o outer, where:

  • inner: The inner layout or swizzle that is applied last

  • offset: An integer tuple representing a coordinate offset

  • outer: The outer layout that is applied first

ComposedLayout implements the functional composition operation where: R(c) := (inner o offset o outer)(c) := inner(offset + outer(c))

This composition allows for complex transformations of coordinates and indices, enabling operations like tiling, partitioning, and reshaping of data.

Variables:
  • inner – The inner layout or swizzle component

  • offset – The coordinate offset applied between inner and outer layouts

  • outer – The outer layout component

  • max_alignment – The maximum alignment of the composed layout

Examples:

# Create a composed layout with inner layout, offset, and outer layout

# inner layout: (4, 8):(1, 4)
inner_layout = make_layout((4, 8))

offset = (0, 0)

# outer layout: (2, 2):(1@0, 1@1)
outer_layout = make_layout((2, 2), stride=(1 * E(0), 1 * E(1)))

# composed layout: (inner o offset o outer)
composed = make_composed_layout(inner_layout, offset, outer_layout)

# Accessing components of the composed layout
inner = composed.inner
offset = composed.offset
outer = composed.outer

# map coordinate (1, 2) to linear index
#  - outer(1, 2) = (1, 2)
#  - offset + outer(1, 2) = (1, 2)
#  - inner(1, 2) = 1 * 1 + 2 * 4 = 9
idx = crd2idx((1, 2), composed)

# Composition is used in many tiling operations
# For example, in logical_product, raked_product, and blocked_product
__init__(*args: Any, **kwargs: Any) None#
class cutlass.cute.Swizzle(*args: Any, **kwargs: Any)#

Bases: Value

Swizzle is a transformation that permutes the elements of a layout.

Swizzles are used to rearrange data elements to improve memory access patterns and computational efficiency.

Swizzle is defined by three parameters: - MBase: The number of least-significant bits to keep constant - BBits: The number of bits in the mask - SShift: The distance to shift the mask

The mask is applied to the least-significant bits of the layout.

0bxxxxxxxxxxxxxxxYYYxxxxxxxZZZxxxx
                              ^--^ MBase is the number of least-sig bits to keep constant
                 ^-^       ^-^     BBits is the number of bits in the mask
                   ^---------^     SShift is the distance to shift the YYY mask
                                      (pos shifts YYY to the right, neg shifts YYY to the left)

e.g. Given
0bxxxxxxxxxxxxxxxxYYxxxxxxxxxZZxxx
the result is
0bxxxxxxxxxxxxxxxxYYxxxxxxxxxAAxxx where AA = ZZ xor YY
cutlass.cute.E(mode: int | List[int]) ScaledBasis#

Create a unit ScaledBasis element with the specified mode.

This function creates a ScaledBasis with value 1 and the given mode. The mode represents the coordinate axis or dimension in the layout.

Parameters:

mode (Union[int, List[int]]) – The mode (dimension) for the basis element, either a single integer or a list of integers

Returns:

A ScaledBasis with value 1 and the specified mode

Return type:

ScaledBasis

Raises:

TypeError – If mode is not an integer or a list

Examples:

# Create a basis element for the first dimension (mode 0)
e0 = E(0)

# Create a basis element for the second dimension (mode 1)
e1 = E(1)

# Create a basis element for a hierarchical dimension
e_hier = E([0, 1])
class cutlass.cute.Atom(op: Op, trait: Trait)#

Bases: ABC

Atom base class.

An Atom is the composition of

  • a MMA or Copy Operation;

  • an internal MMA or Copy Trait.

An Operation is a pure Python class that is used to model a specific MMA or Copy instruction. The Trait wraps the underlying IR Value and provides access to the metadata of the instruction encoded using CuTe Layouts. When the Trait can be constructed straighforwardly from an Operation, the make_mma_atom or make_copy_atom API should be used. There are cases where constructing the metadata is not trivial and requires more information, for example to determine the number of bytes copied per TMA instruction (“the TMA vector length”). In such cases, dedicated helper functions are provided with an appropriate API such that the Atom is constructed internally in an optimal fashion for the user.

__init__(
op: Op,
trait: Trait,
) None#
property op: Op#
property type#
set(modifier, value, *, loc=None, ip=None) None#

Sets runtime fields of the Atom.

Some Atoms have runtime state, for example a tcgen05 MMA Atom

tiled_mma = cute.make_tiled_mma(some_tcgen05_mma_op)
tiled_mma.set(cute.nvgpu.tcgen05.Field.ACCUMULATE, True)

The set method provides a way to the user to modify such runtime state. Modifiable fields are provided by arch-specific enumerations, for example tcgen05.Field. The Atom instance internally validates the field as well as the value provided by the user to set the field to.

_unpack(*, loc=None, ip=None, **kwargs) cutlass._mlir.ir.Value#
_abc_impl = <_abc._abc_data object>#
class cutlass.cute.MmaAtom(op: Op, trait: Trait)#

Bases: Atom

The MMA Atom class.

property thr_id: cutlass.cute.typing.Layout#
property shape_mnk: cutlass.cute.typing.Shape#
property tv_layout_A: cutlass.cute.typing.Layout#
property tv_layout_B: cutlass.cute.typing.Layout#
property tv_layout_C: cutlass.cute.typing.Layout#
make_fragment_A(input, *, loc=None, ip=None)#
make_fragment_B(input, *, loc=None, ip=None)#
make_fragment_C(input, *, loc=None, ip=None)#
_abc_impl = <_abc._abc_data object>#
class cutlass.cute.CopyAtom(op: Op, trait: Trait)#

Bases: Atom

The Copy Atom class.

property value_type: Type[cutlass.cute.typing.Numeric]#
property thr_id: cutlass.cute.typing.Layout#
property layout_src_tv: cutlass.cute.typing.Layout#
property layout_dst_tv: cutlass.cute.typing.Layout#
_abc_impl = <_abc._abc_data object>#
class cutlass.cute.TiledCopy(op: Op, trait: Trait)#

Bases: CopyAtom

The tiled Copy class.

property layout_tv_tiled: cutlass.cute.typing.Layout#
property tiler_mn: cutlass.cute.typing.Tile#
property layout_src_tv_tiled: cutlass.cute.typing.Layout#
property layout_dst_tv_tiled: cutlass.cute.typing.Layout#
property size: int#
get_slice(
thr_idx: int | cutlass.cute.typing.Int32,
) ThrCopy#
retile(src, *, loc=None, ip=None)#
_abc_impl = <_abc._abc_data object>#
class cutlass.cute.TiledMma(op: Op, trait: Trait)#

Bases: MmaAtom

The tiled MMA class.

property tv_layout_A_tiled: cutlass.cute.typing.Layout#
property tv_layout_B_tiled: cutlass.cute.typing.Layout#
property tv_layout_C_tiled: cutlass.cute.typing.Layout#
property permutation_mnk: cutlass.cute.typing.Tile#
property thr_layout_vmnk: cutlass.cute.typing.Layout#
property size: int#
get_tile_size(mode_idx: int) cutlass.cute.typing.Shape#
get_slice(
thr_idx: int | cutlass.cute.typing.Int32,
) ThrMma#
_partition_shape(operand_id, shape, *, loc=None, ip=None)#
partition_shape_A(shape_mk, *, loc=None, ip=None)#
partition_shape_B(shape_nk, *, loc=None, ip=None)#
partition_shape_C(shape_mn, *, loc=None, ip=None)#
_thrfrg(
operand_id,
input: cutlass.cute.typing.Layout,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Layout#
_thrfrg(
operand_id,
input: cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Tensor
_thrfrg_A(
input: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor#
_thrfrg_B(
input: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor#
_thrfrg_C(
input: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor#
_abc_impl = <_abc._abc_data object>#
class cutlass.cute.TensorSSA(*args: Any, **kwargs: Any)#

Bases: ArithValue

A class representing thread local data from CuTe Tensor in value semantic and immutable.

Parameters:
  • value (ir.Value) – Flatten vector as ir.Value holding logic data of SSA Tensor

  • shape (Shape) – The nested shape in CuTe of the vector

  • dtype (Type[Numeric]) – Data type of the tensor elements

Variables:
  • _shape – The nested shape in CuTe of the vector

  • _dtype – Data type of the tensor elements

Raises:

ValueError – If shape is not static

__init__(
value,
shape: cutlass.cute.typing.Shape,
dtype: Type[cutlass.cute.typing.Numeric],
)#

Initialize a new TensorSSA object.

Parameters:
  • value (ir.Value) – Flatten vector as ir.Value holding logic data of SSA Tensor

  • shape (Shape) – The nested shape in CuTe of the vector

  • dtype (Type[Numeric]) – Data type of the tensor elements

Raises:

ValueError – If shape is not static

property dtype: Type[cutlass.cute.typing.Numeric]#
property element_type: Type[cutlass.cute.typing.Numeric]#
property shape#
_apply_op(
op,
other: TensorSSA,
flip,
*,
loc,
ip,
) TensorSSA#
_apply_op(
op,
other: cutlass.cutlass_dsl.cutlass_arith.ArithValue,
flip,
*,
loc,
ip,
) TensorSSA
_apply_op(
op,
other: int | float | bool,
flip,
*,
loc,
ip,
) TensorSSA
_flatten_shape_and_coord(crd, *, loc=None, ip=None)#
_build_result(res_vect, res_shp, *, loc=None, ip=None)#
to(
dtype: Type[cutlass.cute.typing.Numeric],
*,
loc=None,
ip=None,
)#

Convert the tensor to a different numeric type.

Parameters:

dtype (Type[Numeric]) – The target numeric type to cast to.

Returns:

A new tensor with the same shape but with elements cast to the target type.

Return type:

TensorSSA

Raises:
  • TypeError – If dtype is not a subclass of Numeric.

  • NotImplementedError – If dtype is an unsigned integer type.

ir_value(*, loc=None, ip=None)#
reduce(
op,
init_val,
reduction_profile: cutlass.cute.typing.Coord,
*,
loc=None,
ip=None,
)#

Perform reduce on selected modes with given predefined reduction op.

Parameters:
  • op (operator) – The reduction operator to use (operator.add or operator.mul)

  • init_val (numeric) – The initial value for the reduction

  • reduction_profile (Coord) – Specifies which dimensions to reduce. Dimensions marked with ‘_’ are kept.

Returns:

The reduced tensor

Return type:

TensorSSA

Examples

reduce(f32 o (4,))

=> f32

reduce(f32 o (4, 5))

=> f32

reduce(f32 o (4, (5, 4)), reduction_profile=(_, 1))

=> f32 o (4,)

reduce(f32 o (4, (5, 4)), reduction_profile=(_, (_, 1)))

=> f32 o (4, (5,))

cutlass.cute.assume(src, divby=None, *, loc=None, ip=None)#
cutlass.cute.is_integer(a) bool#

Check if an object is static integer or dynamic integer

cutlass.cute.is_int_tuple(a) bool#
cutlass.cute.is_static(
x: cutlass._mlir.ir.Type | cutlass._mlir.ir.Value | cutlass.cute.typing.XTuple,
) bool#

Check if a value is statically known at compile time.

In CuTe, static values are those whose values are known at compile time, as opposed to dynamic values which are only known at runtime.

Parameters:

x (Union[ir.Type, ir.Value, XTuple]) – The value to check

Returns:

True if the value is static, False otherwise

Return type:

bool

Raises:

TypeError – If an unsupported type is provided

cutlass.cute.size(
a: cutlass.cute.typing.IntTuple | cutlass.cute.typing.Shape | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor,
mode: List[int] = [],
*,
loc=None,
ip=None,
) cutlass.cute.typing.Int#

Return size of domain of layout or tensor.

Computes the size (number of elements) in the domain of a layout or tensor. For layouts, this corresponds to the shape of the coordinate space. See Section 3.2 in the CuTe Whitepaper for more details on layout domains.

Parameters:
  • a (IntTuple, Shape, Layout, ComposedLayout or Tensor) – The input object whose size to compute

  • mode (list of int, optional) – List of mode(s) for size calculation. If empty, computes total size, defaults to []

  • loc (optional) – Source location for MLIR, defaults to None

  • ip (optional) – Insertion point, defaults to None

Returns:

Static size of layout or tensor if static, otherwise a Value

Return type:

int or Value

Raises:

ValueError – If mode contains non-integer elements

cutlass.cute.has_underscore(a: cutlass.cute.typing.XTuple) bool#
cutlass.cute.slice_(src, coord: cutlass.cute.typing.Coord, *, loc=None, ip=None)#

Perform a slice operation on a source object using the given coordinate.

This function implements CuTe’s slicing operation which extracts a subset of elements from a source object (tensor, layout, etc.) based on a coordinate pattern. The slice operation preserves the structure of the source while selecting specific elements.

Parameters:
  • src (Union[Tensor, Layout, IntTuple, Value]) – Source object to be sliced (tensor, layout, tuple, etc.)

  • coord (Coord) – Coordinate pattern specifying which elements to select

  • loc (Optional[Location]) – Source location information, defaults to None

  • ip (Optional[InsertionPoint]) – Insertion point for IR generation, defaults to None

Returns:

A new object containing the sliced elements

Return type:

Union[Tensor, Layout, IntTuple, tuple]

Raises:

ValueError – If the coordinate pattern is incompatible with source

Examples:

# Layout slicing
layout = make_layout((4,4))

# Select 1st index of first mode and keep all elements in second mode
sub_layout = slice_(layout, (1, None))
# Basic tensor slicing
tensor = make_tensor(...)           # Create a 2D tensor

# Select 1st index of first mode and keep all elements in second mode
sliced = slice_(tensor, (1, None))
# Select 2nd index of second mode and keep all elements in first mode
sliced = slice_(tensor, (None, 2))

Note

  • None represents keeping all elements in that mode

  • Slicing preserves the layout/structure of the original object

  • Can be used for: * Extracting sub-tensors/sub-layouts * Creating views into data * Selecting specific patterns of elements

cutlass.cute.depth(
a: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster,
) int#

Returns the depth (nesting level) of a tuple, layout, or tensor.

The depth of a tuple is the maximum depth of its elements plus 1. For an empty tuple, the depth is 1. For layouts and tensors, the depth is determined by the depth of their shape. For non-tuple values (e.g., integers), the depth is considered 0.

Parameters:

a (Union[XTuple, Layout, ComposedLayout, Tensor, Any]) – The object whose depth is to be determined

Returns:

The depth of the input object

Return type:

int

Example:

>>> depth(1)
0
>>> depth((1, 2))
1
>>> depth(((1, 2), (3, 4)))
2
cutlass.cute.rank(
a: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster,
) int#

Returns the rank (dimensionality) of a tuple, layout, or tensor.

The rank of a tuple is its length. For layouts and tensors, the rank is determined by the rank of their shape. For non-tuple values (e.g., integers), the rank is considered 1 for convenience.

Parameters:

a (Union[XTuple, Layout, ComposedLayout, Tensor, Any]) – The object whose rank is to be determined

Returns:

The rank of the input object

Return type:

int

This function is used in layout algebra to determine the dimensionality of tensors and layouts for operations like slicing and evaluation.

cutlass.cute.printf(*args, loc=None, ip=None) None#

Print a value or a list of values.

Parameters:
  • args (list) – List of values to print

  • loc (source location, optional) – Source location where it’s called, defaults to None

  • ip (insertion pointer, optional) – Insertion pointer, defaults to None

Raises:

ValueError – If no arguments are provided or if an unsupported argument type is passed

cutlass.cute.print_tensor(
tensor: cutlass.cute.typing.Tensor,
*,
verbose: bool = False,
loc=None,
ip=None,
)#

Print content of the tensor in human readable format.

tensor(raw_ptr<@…, Float32, generic, align(4)> o (8,5):(5,1), data=
[[-0.4326, -0.5434, 0.1238, 0.7132, 0.8042],

[-0.8462, 0.9871, 0.4389, 0.7298, 0.6948], [ 0.3426, 0.5856, 0.1541, 0.2923, 0.6976], [-0.1649, 0.8811, 0.1788, 0.1404, 0.2568], [-0.2944, 0.8593, 0.4171, 0.8998, 0.1766], [ 0.8814, 0.7919, 0.7390, 0.4566, 0.1576], [ 0.9159, 0.7577, 0.6918, 0.0754, 0.0591], [ 0.6551, 0.1626, 0.1189, 0.0292, 0.8655]])

cutlass.cute.pretty_str(arg) str#

Constructs a concise readable pretty string.

cutlass.cute.make_layout(
shape: cutlass.cute.typing.Shape,
*,
stride: cutlass.cute.typing.Stride | None = None,
loc=None,
ip=None,
) cutlass.cute.typing.Layout#

Create a CuTe Layout object from shape and optional stride information.

A Layout in CuTe represents the mapping between logical and physical coordinates of a tensor. This function creates a Layout object that defines how tensor elements are arranged in memory.

Parameters:
  • shape (Shape) – Shape of the layout defining the size of each mode

  • stride (Union[Stride, None]) – Optional stride values for each mode, defaults to None

  • loc (Optional[Location]) – Source location information, defaults to None

  • ip (Optional[InsertionPoint]) – Insertion point for IR generation, defaults to None

Returns:

A new Layout object with the specified shape and stride

Return type:

Layout

Examples:

# Create a 2D compact left-most layout with shape (4,4)
layout = make_layout((4,4))                     # compact left-most layout

# Create a left-most layout with custom strides
layout = make_layout((4,4), stride=(1,4))       # left-most layout with strides (1,4)

# Create a layout for a 3D tensor
layout = make_layout((32,16,8))                 # left-most layout

# Create a layout with custom strides
layout = make_layout((2,2,2), stride=(4,1,2))   # layout with strides (4,1,2)

Note

  • If stride is not provided, a default compact left-most stride is computed based on the shape

  • The resulting layout maps logical coordinates to physical memory locations

  • The layout object can be used for tensor creation and memory access patterns

  • Strides can be used to implement: * Row-major vs column-major layouts * Padding and alignment * Blocked/tiled memory arrangements * Interleaved data formats

  • Stride is keyword only argument to improve readability, e.g. * make_layout((3,4), (1,4)) can be confusing with make_layout(((3,4), (1,4))) * make_layout((3,4), stride=(1,4)) is more readable

cutlass.cute.recast_layout(
new_type_bits,
old_type_bits,
src_layout,
*,
loc=None,
ip=None,
)#
cutlass.cute.make_identity_layout(
shape: cutlass.cute.typing.Shape,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Layout#

Create an identity layout with the given shape.

An identity layout maps logical coordinates directly to themselves without any transformation. This is equivalent to a layout with stride (1@0,1@1,…,1@(N-1)).

Parameters:
  • shape (Shape) – The shape of the layout

  • loc (Optional[Location]) – Source location information, defaults to None

  • ip (Optional[InsertionPoint]) – Insertion point for IR generation, defaults to None

Returns:

A new identity Layout object with the specified shape

Return type:

Layout

Examples:

# Create a 2D identity layout with shape (4,4)
layout = make_identity_layout((4,4))     # stride=(1@0,1@1)

# Create a 3D identity layout
layout = make_identity_layout((32,16,8)) # stride=(1@0,1@1,1@2)

Note

  • An identity layout is a special case where each coordinate maps to itself

  • Useful for direct coordinate mapping without any transformation

cutlass.cute.make_ordered_layout(
shape: cutlass.cute.typing.Shape,
order: cutlass.cute.typing.Shape,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Layout#

Create a layout with a specific ordering of dimensions.

This function creates a layout where the dimensions are ordered according to the specified order parameter, allowing for custom dimension ordering in the layout.

Parameters:
  • shape (Shape) – The shape of the layout

  • order (Shape) – The ordering of dimensions

  • loc (Optional[Location]) – Source location information, defaults to None

  • ip (Optional[InsertionPoint]) – Insertion point for IR generation, defaults to None

Returns:

A new Layout object with the specified shape and dimension ordering

Return type:

Layout

Examples:

# Create a row-major layout
layout = make_ordered_layout((4,4), order=(1,0))

# Create a column-major layout
layout = make_ordered_layout((4,4), order=(0,1))         # stride=(1,4)

# Create a layout with custom dimension ordering for a 3D tensor
layout = make_ordered_layout((32,16,8), order=(2,0,1))   # stride=(128,1,16)

Note

  • The order parameter specifies the ordering of dimensions from fastest-varying to slowest-varying

  • For a 2D tensor, (0,1) creates a column-major layout, while (1,0) creates a row-major layout

  • The length of order must match the rank of the shape

cutlass.cute.make_composed_layout(
inner,
offset: cutlass.cute.typing.IntTuple,
outer: cutlass.cute.typing.Layout,
*,
loc=None,
ip=None,
) cutlass._mlir.ir.register_value_caster#

Create a composed layout by composing an inner transformation with an outer layout.

As described in the CuTe whitepaper, a composed layout applies a sequence of transformations to coordinates. The composition is defined as (inner ∘ offset ∘ outer), where the operations are applied from right to left.

Parameters:
  • inner (Union[Layout, Swizzle]) – The inner transformation (can be a Layout or Swizzle)

  • offset (IntTuple) – An integral offset applied between transformations

  • outer (Layout) – The outer (right-most) layout that is applied first

  • loc (Optional[Location]) – Source location information, defaults to None

  • ip (Optional[InsertionPoint]) – Insertion point for IR generation, defaults to None

Returns:

A new ComposedLayout representing the composition

Return type:

ComposedLayout

Examples:

# Create a basic layout
inner = make_layout(...)
outer = make_layout((4,4), stride=(E(0), E(1)))

# Create a composed layout with an offset
composed = make_composed_layout(inner, (2,0), outer)

Note

  • The composition applies transformations in the order: outer → offset → inner

  • The stride divisibility condition must be satisfied for valid composition

  • Certain compositions (like Swizzle with scaled basis) are invalid and will raise errors

  • Composed layouts inherit many properties from the outer layout

cutlass.cute.make_layout_tv(
thr_layout: cutlass.cute.typing.Layout,
val_layout: cutlass.cute.typing.Layout,
*,
loc=None,
ip=None,
) Tuple[cutlass.cute.typing.Shape, cutlass.cute.typing.Layout]#

Create a tiled copy given separate thr and val layouts. A TV partitioner is inferred based on inputs. Requires input thr layout be compact.

Parameters:
  • atom (copy atom, e.g. smit_copy and simt_async_copy, tma_load, etc.)

  • thr_layout (mn -> tid (need to be compact?))

  • val_layout (mn -> vid)

  • loc (source location for mlir (optional))

  • ip (insertion point (optional))

Returns:

  • layout_mn – logical tile size

  • layout_tv – thread-value layout (tid, vid) -> mn

cutlass.cute.make_layout_image_mask(
lay: cutlass.cute.typing.Layout,
coord: cutlass.cute.typing.Coord,
mode: int,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Int16#

Makes a 16-bit integer mask of the image of a layout sliced at a given mode and accounting for the offset given by the input coordinate for the other modes.

cutlass.cute.make_ptr(
dtype: Type[cutlass.cute.typing.Numeric] | None,
value,
mem_space: cutlass.cute.typing.AddressSpace = cutlass.cute.typing.AddressSpace.generic,
*,
assumed_align=None,
loc=None,
ip=None,
) cutlass.cute.typing.Pointer#
cutlass.cute.make_tensor(
iterator,
layout: cutlass.cute.typing.Shape | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Tensor#

Creates a tensor by composing an engine (iterator/pointer) with a layout.

A tensor is defined as T = E ∘ L, where E is an engine (array, pointer, or counting iterator) and L is a layout that maps logical coordinates to physical offsets. The tensor evaluates coordinates by applying the layout mapping and dereferencing the engine at the resulting offset.

Parameters:
  • iterator (Union[Pointer, IntTuple]) – Engine component (pointer, iterator, or counting iterator) that provides data access capabilities

  • layout (Union[Shape, Layout, ComposedLayout]) – Layout component that defines the mapping from logical coordinates to physical offsets

  • loc (Optional[Location]) – Source location for MLIR operation tracking, defaults to None

  • ip (Optional[InsertionPoint]) – Insertion point for MLIR operation, defaults to None

Returns:

A tensor object representing the composition E ∘ L

Return type:

Tensor

Raises:

ValueError – If iterator type is not supported

Examples:

# Create a tensor with row-major layout
layout = make_layout((64, 128), stride=(128, 1))
tensor = make_tensor(ptr, layout)

# Create a tensor with hierarchical layout
layout = make_layout(((128, 8), (1, 4, 1)), stride=((32, 1), (0, 8, 4096)))
tensor = make_tensor(smem_ptr, layout)

# Create a counting tensor
layout = make_layout(2, stride=16 * E(0))
tensor = make_tensor(5, layout)

Notes

  • The engine (iterator) must support random access operations

  • Common engine types include raw pointers, arrays, and random-access iterators

  • The layout defines both the shape (logical dimensions) and stride (physical mapping)

  • Supports both direct coordinate evaluation T(c) and partial evaluation (slicing)

cutlass.cute.make_identity_tensor(
shape: cutlass.cute.typing.Shape,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Tensor#

Creates an identity tensor with the given shape.

An identity tensor maps each coordinate to itself, effectively creating a counting sequence within the shape’s bounds. This is useful for generating coordinate indices or creating reference tensors for layout transformations.

Parameters:
  • shape (Shape) – The shape defining the tensor’s dimensions. Can be a simple integer sequence or a hierarchical structure ((m,n),(p,q))

  • loc (Optional[Location]) – Source location for MLIR operation tracking, defaults to None

  • ip (Optional[InsertionPoint]) – Insertion point for MLIR operation, defaults to None

Returns:

A tensor that maps each coordinate to itself

Return type:

Tensor

Examples:

# Create a simple 1D counting tensor
tensor = make_identity_tensor(6)  # [0,1,2,3,4,5]

# Create a 2D counting tensor
tensor = make_identity_tensor((3,2))  # [(0,0),(1,0),(2,0),(0,1),(1,1),(2,1)]

# Create hierarchical counting tensor
tensor = make_identity_tensor(((2,1),3))
# [((0,0),0),((1,0),0),((0,0),1),((1,0),1),((0,0),2),((1,0),2)]

Notes

  • The shape parameter follows CuTe’s IntTuple concept

  • Coordinates are ordered colexicographically

  • Useful for generating reference coordinates in layout transformations

cutlass.cute.make_fragment(
layout_or_shape: cutlass.cute.typing.Layout | cutlass.cute.typing.Shape,
dtype: Type[cutlass.cute.typing.Numeric],
*,
loc=None,
ip=None,
) cutlass.cute.typing.Tensor#
cutlass.cute.make_fragment_like(src, dtype=None, *, loc=None, ip=None)#

Create tensor with a compact layout in the same shape as the source on stack.

This function either creates a fragment tensor with compact layout in same shape as the source layout or a new layout with the same shape as the source. The strides of the new layout follow the order induced by the source’s strides, with a special handling of the 0th mode: it is always stride-1 and generated in column-major order (LayoutLeft).

Parameters:
  • src (Union[Layout, ComposedLayout, Tensor]) – The source layout or tensor whose shape will be matched

  • dtype (Type[Numeric], optional) – The element type for the fragment tensor, defaults to None

  • loc (Location, optional) – Source location for MLIR operations, defaults to None

  • ip (InsertionPoint, optional) – Insertion point for MLIR operations, defaults to None

Returns:

A new layout or fragment tensor with matching shape

Return type:

Union[Layout, Tensor]

Examples

Creating a rmem tensor from a tensor:

smem_tensor = cute.make_tensor(smem_ptr, layout)
frag_tensor = cute.make_fragment_like(smem_tensor, cutlass.Float32)
# frag_tensor will be a register-backed tensor with the same shape

Creating a fragment with a different element type:

tensor = cute.make_tensor(gmem_ptr, layout)
bool_frag = cute.make_fragment_like(tensor, cutlass.Boolean)
# bool_frag will be a register-backed tensor with Boolean elements

Notes

  • When used with a Tensor, if a type is provided, it will create a new fragment tensor with that element type.

  • For layouts with ScaledBasis strides, the function creates a fragment from the shape only.

  • This function is commonly used in GEMM and other tensor operations to create register storage for intermediate results.

cutlass.cute.recast_ptr(
ptr: cutlass.cute.typing.Pointer,
swizzle_=None,
dtype: Type[cutlass.cute.typing.Numeric] | None = None,
loc=None,
ip=None,
) cutlass.cute.typing.Pointer#
cutlass.cute.recast_tensor(
src: cutlass.cute.typing.Tensor,
dtype: Type[cutlass.cute.typing.Numeric],
swizzle_=None,
*,
loc=None,
ip=None,
)#
cutlass.cute.get(input, mode: List[int], *, loc=None, ip=None)#

Extract a specific element or sub-layout from a layout or tuple.

This function recursively traverses the input according to the mode indices, extracting the element at the specified path. For layouts, this operation corresponds to extracting a specific sub-layout.

Parameters:
  • input (Layout, ComposedLayout, tuple) – The input layout or tuple to extract from

  • mode (List[int]) – Indices specifying the path to traverse for extraction

  • loc (optional) – Source location for MLIR, defaults to None

  • ip (optional) – Insertion point, defaults to None

Returns:

The extracted element or sub-layout

Return type:

Layout, ComposedLayout, or element type

Raises:
  • ValueError – If any index in mode is out of range

  • TypeError – If mode contains non-integer elements or if input has unsupported type

Examples:

For a layout like ((4,8),2):((16,1),8), get with mode=[0,1] would extract the element 8 from the shape component.

cutlass.cute.select(input, mode: List[int], *, loc=None, ip=None)#

Select modes from input.

Parameters:
  • input (Layout, ComposedLayout, tuple) – Input to select from

  • mode (List[int]) – Indices specifying which dimensions or elements to select

  • loc (optional) – Source location for MLIR, defaults to None

  • ip (optional) – Insertion point, defaults to None

Returns:

A new instance with selected dimensions/elements

Return type:

Layout, ComposedLayout, tuple

Raises:
  • ValueError – If any index in mode is out of range

  • TypeError – If the input type is invalid

cutlass.cute.front(input, *, loc=None, ip=None)#

Recursively get the first element of input.

This function traverses a hierarchical structure (like a layout or tensor) and returns the first element at the deepest level. It’s particularly useful for accessing the first stride value in a layout to determine properties like majorness.

Parameters:
  • input (Union[Tensor, Layout, Stride]) – The hierarchical structure to traverse

  • loc (source location, optional) – Source location where it’s called, defaults to None

  • ip (insertion pointer, optional) – Insertion pointer for IR generation, defaults to None

Returns:

The first element at the deepest level of the input structure

Return type:

Union[int, float, bool, ir.Value]

cutlass.cute.is_major(
mode,
stride: cutlass.cute.typing.Stride,
*,
loc=None,
ip=None,
) bool#

Check whether a mode in stride is the major mode.

cutlass.cute.find(
x: int,
t: tuple | cutlass._mlir.ir.Value | int,
*,
exclude_when: cutlass.cute.typing.IntTuple | None = None,
loc=None,
ip=None,
) int | Tuple[int, ...] | None#

Find the first position of a x in t. If exclude_when is provided, the positions where comparison equals comparison_value will be excluded from the search results.

Parameters:
  • x (int) – The static integer x to search for

  • t (Union[tuple, ir.Value, int]) – The search space

  • exclude_when (Optional[Tuple[int, Union[tuple, ir.Value, int]]]) – A tuple of (comparison_value, comparison) - positions where comparison equals comparison_value will be excluded from the search results

Returns:

Index if found at top level, tuple of indices showing nested position, or None if not found

Return type:

Union[int, Tuple[int, …], None]

cutlass.cute.coalesce(
input,
*,
target_profile: cutlass.cute.typing.Coord | None = None,
loc=None,
ip=None,
)#
cutlass.cute.group_modes(input, begin: int, end: int = -1, *, loc=None, ip=None)#

Group modes of a hierarchical tuple or layout into a single mode.

This function groups a range of modes from the input object into a single mode, creating a hierarchical structure. For tuples, it creates a nested tuple containing the specified range of elements. For layouts and other CuTe objects, it creates a hierarchical representation where the specified modes are grouped together.

Parameters:
  • input (Layout, ComposedLayout, tuple, Shape, Stride, etc.) – Input object to group modes from (layout, tuple, etc.)

  • beg (int) – Beginning index of the range to group (inclusive)

  • end (int) – Ending index of the range to group (exclusive)

  • loc (optional) – Source location for MLIR, defaults to None

  • ip (optional) – Insertion point, defaults to None

Returns:

A new object with the specified modes grouped

Return type:

Same type as input with modified structure

Examples:

# Group modes in a tuple
t = (2, 3, 4, 5)
grouped = group_modes(t, 1, 3)  # (2, (3, 4), 5)

# Group modes in a layout
layout = make_layout((2, 3, 4, 5))
grouped_layout = group_modes(layout, 1, 3)  # Layout with shape (2, (3, 4), 5)

# Group modes in a shape
shape = make_shape(2, 3, 4, 5)
grouped_shape = group_modes(shape, 0, 2)  # Shape ((2, 3), 4, 5)
cutlass.cute.cosize(
a: cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor,
mode: List[int] = [],
*,
loc=None,
ip=None,
)#

Return size of codomain of layout or tensor. Return static value if type is static.

Parameters:
  • a (Union[Layout, ComposedLayout, Tensor]) – Layout, ComposedLayout, or Tensor object

  • mode (List[int], optional) – List of mode(s) for cosize calculation

  • loc (optional) – Location information for diagnostics, defaults to None

  • ip (optional) – Instruction pointer for diagnostics, defaults to None

Returns:

Static size of layout or tensor (fast fold) if static, or a dynamic Value

Return type:

Union[int, Value]

cutlass.cute.size_in_bytes(
dtype: Type[cutlass.cute.typing.Numeric],
layout: cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster,
*,
loc=None,
ip=None,
)#

Calculate the size in bytes based on its data type and layout.

Parameters:
  • dtype (Type[Numeric]) – The DSL numeric data type

  • layout (Layout, optional) – The layout of the elements. If None, the function returns 0

  • loc (optional) – Location information for diagnostics, defaults to None

  • ip (optional) – Instruction pointer for diagnostics, defaults to None

Returns:

The total size in bytes. Returns 0 if the layout is None

Return type:

int

cutlass.cute.flatten_to_tuple(
a: cutlass.cute.typing.IntTuple | cutlass.cute.typing.Coord | cutlass.cute.typing.Shape | cutlass.cute.typing.Stride,
) tuple#

Flattens a potentially nested tuple structure into a flat tuple.

This function recursively traverses the input structure and flattens it into a single-level tuple, preserving the order of elements.

Parameters:

a (Union[IntTuple, Coord, Shape, Stride]) – The structure to flatten

Returns:

A flattened tuple containing all elements from the input

Return type:

tuple

Examples:

flatten_to_tuple((1, 2, 3))       # Returns (1, 2, 3)
flatten_to_tuple(((1, 2), 3))     # Returns (1, 2, 3)
flatten_to_tuple((1, (2, (3,))))  # Returns (1, 2, 3)
cutlass.cute.flatten(
a: cutlass.cute.typing.IntTuple | cutlass.cute.typing.Coord | cutlass.cute.typing.Shape | cutlass.cute.typing.Stride | cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor,
) tuple#

Flattens a CuTe data structure into a simpler form.

For tuples, this function flattens the structure into a single-level tuple. For non-tuple types, it returns the input unchanged.

Parameters:

a (Union[IntTuple, Coord, Shape, Stride, Layout, Tensor]) – The structure to flatten

Returns:

The flattened structure

Return type:

Union[tuple, Any]

Raises:

NotImplementedError – If input is a Layout or Tensor

Examples:

flatten((1, 2, 3))           # Returns (1, 2, 3)
flatten(((1, 2), (3, 4)))    # Returns (1, 2, 3, 4)
flatten(5)                   # Returns 5
cutlass.cute.product(
a: cutlass.cute.typing.IntTuple | cutlass.cute.typing.Shape,
*,
loc=None,
ip=None,
)#

Return product of the given IntTuple or Shape.

Computes the product of all elements in the input tuple or shape. Returns static value if type is static.

Parameters:
  • a (IntTuple or Shape) – The input tuple or shape

  • loc (optional) – Source location for MLIR, defaults to None

  • ip (optional) – Insertion point, defaults to None

Returns:

Static product of IntTuple or Shape if static, otherwise a Value

Return type:

int or Value

Raises:

TypeError – If input is not an IntTuple or Shape

cutlass.cute.product_like(
a: cutlass.cute.typing.IntTuple | cutlass.cute.typing.Shape,
target_profile: cutlass.cute.typing.XTuple,
*,
loc=None,
ip=None,
)#

Return product of the given IntTuple or Shape at leaves of target_profile.

This function computes products according to the structure defined by target_profile.

Parameters:
  • a (IntTuple or Shape) – The input tuple or shape

  • target_profile (XTuple) – The profile that guides how products are computed

  • loc (optional) – Source location for MLIR, defaults to None

  • ip (optional) – Insertion point, defaults to None

Returns:

The resulting tuple with products computed according to target_profile

Return type:

IntTuple or Shape

Raises:
  • TypeError – If inputs have incompatible types

  • ValueError – If inputs have incompatible shapes

cutlass.cute.product_each(a, *, loc=None, ip=None)#

Compute products for each component of the input.

Returns a rank(a) tuple result such that get(result, mode=[i]) == product(get(a, mode=[i]))

Parameters:
  • a (IntTuple or Shape) – The input tuple or shape

  • loc (optional) – Source location for MLIR, defaults to None

  • ip (optional) – Insertion point, defaults to None

Returns:

A tuple containing products for each component

Return type:

tuple

Raises:

TypeError – If input is not an IntTuple or Shape

cutlass.cute.prepend(
input,
elem,
up_to_rank: None | int = None,
*,
loc=None,
ip=None,
)#

Extend input to rank up_to_rank by prepending elem in front of input.

This function extends the input object by prepending elements to reach a desired rank. It supports various CuTe types including shapes, layouts, tensors etc.

Parameters:
  • input (Union[Shape, Stride, Coord, IntTuple, Tile, Layout, ComposedLayout, Tensor]) – Source to be prepended to

  • elem (Union[Shape, Stride, Coord, IntTuple, Tile, Layout]) – Element to prepend to input

  • up_to_rank (Union[None, int], optional) – The target rank after extension, defaults to None

  • loc (Optional[Location]) – Source location for MLIR, defaults to None

  • ip (Optional[InsertionPoint]) – Insertion point, defaults to None

Returns:

The extended result with prepended elements

Return type:

Union[Shape, Stride, Coord, IntTuple, Tile, Layout, ComposedLayout, Tensor]

Raises:
  • ValueError – If up_to_rank is less than input’s current rank

  • TypeError – If input or elem has unsupported type

Examples:

# Prepend to a Shape
shape = (4,4)
prepend(shape, 2)                   # Returns (2,4,4)

# Prepend to a Layout
layout = make_layout((8,8))
prepend(layout, make_layout((2,)))  # Returns (2,8,8):(1,1,8)

# Prepend with target rank
coord = (1,1)
prepend(coord, 0, up_to_rank=4)     # Returns (0,0,1,1)
cutlass.cute.append(
input,
elem,
up_to_rank: None | int = None,
*,
loc=None,
ip=None,
)#

Extend input to rank up_to_rank by appending elem to the end of input.

This function extends the input object by appending elements to reach a desired rank. It supports various CuTe types including shapes, layouts, tensors etc.

Parameters:
  • input (Union[Shape, Stride, Coord, IntTuple, Tile, Layout, ComposedLayout, Tensor]) – Source to be appended to

  • elem (Union[Shape, Stride, Coord, IntTuple, Tile, Layout]) – Element to append to input

  • up_to_rank (Union[None, int], optional) – The target rank after extension, defaults to None

  • loc (Optional[Location]) – Source location for MLIR, defaults to None

  • ip (Optional[InsertionPoint]) – Insertion point, defaults to None

Returns:

The extended result with appended elements

Return type:

Union[Shape, Stride, Coord, IntTuple, Tile, Layout, ComposedLayout, Tensor]

Raises:
  • ValueError – If up_to_rank is less than input’s current rank

  • TypeError – If input or elem has unsupported type

Examples:

# Append to a Shape
shape = (4,4)
append(shape, 2)                   # Returns (4,4,2)

# Append to a Layout
layout = make_layout((8,8))
append(layout, make_layout((2,)))  # Returns (8,8,2):(1,8,1)

# Append with target rank
coord = (1,1)
append(coord, 0, up_to_rank=4)     # Returns (1,1,0,0)

Note

  • The function preserves the structure of the input while extending it

  • Can be used to extend tensors, layouts, shapes and other CuTe types

  • When up_to_rank is specified, fills remaining positions with elem

  • Useful for tensor reshaping and layout transformations

cutlass.cute.prepend_ones(
t: cutlass.cute.typing.Tensor,
up_to_rank: None | int = None,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Tensor#
cutlass.cute.append_ones(
t: cutlass.cute.typing.Tensor,
up_to_rank: None | int = None,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Tensor#
cutlass.cute.ceil_div(
input: cutlass.cute.typing.Shape,
tiler: cutlass.cute.typing.Tiler,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Shape#

Compute the ceiling division of a target shape by a tiling specification.

This function computes the number of tiles required to cover the target domain. It is equivalent to the second mode of zipped_divide(input, tiler).

Parameters:
  • input (Shape) – A tuple of integers representing the dimensions of the target domain.

  • tiler (Union[Layout, Shape, Tile]) – The tiling specification.

  • loc (optional) – Optional location information for IR diagnostics.

  • ip (optional) – Optional instruction pointer or context for underlying IR functions.

Returns:

A tuple of integers representing the number of tiles required along each dimension, i.e. the result of the ceiling division of the input dimensions by the tiler dimensions.

Return type:

Shape

Example:

import cutlass.cute as cute
@cute.jit
def foo():
    input = (10, 6)
    tiler = (3, 4)
    result = cute.ceil_div(input, tiler)
    print(result)  # Outputs: (4, 2)
cutlass.cute.round_up(
a: cutlass.cute.typing.IntTuple,
b: cutlass.cute.typing.IntTuple,
) cutlass.cute.typing.IntTuple#

Rounds up elements of a using elements of b.

cutlass.cute.slice_and_offset(coord, src, *, loc=None, ip=None)#
cutlass.cute.crd2idx(
coord: cutlass.cute.typing.Coord,
layout,
*,
loc=None,
ip=None,
)#

Convert a multi-dimensional coordinate into a value using the specified layout.

This function computes the inner product of the flattened coordinate and stride:

index = sum(flatten(coord)[i] * flatten(stride)[i] for i in range(len(coord)))

Parameters:
  • coord (Coord) – A tuple or list representing the multi-dimensional coordinate (e.g., (i, j) for a 2D layout).

  • layout (Layout or ComposedLayout) – A layout object that defines the memory storage layout, including shape and stride, used to compute the inner product.

  • loc (optional) – Optional location information for IR diagnostics.

  • ip (optional) – Optional instruction pointer or context for underlying IR functions.

Returns:

The result of applying the layout transformation to the provided coordinate.

Return type:

Any type that the layout maps to

Example:

import cutlass.cute as cute
@cute.jit
def foo():
    L = cute.make_layout((5, 4), stride=(4, 1))
    idx = cute.crd2idx((2, 3), L)
    # Computed as: 2 * 4 + 3 = 11
    print(idx)
foo()  # Expected output: 11
cutlass.cute.domain_offset(
coord: cutlass.cute.typing.Coord,
tensor: cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Tensor#
cutlass.cute.elem_less(
lhs: cutlass.cute.typing.Shape | cutlass.cute.typing.IntTuple | cutlass.cute.typing.Coord,
rhs: cutlass.cute.typing.Shape | cutlass.cute.typing.IntTuple | cutlass.cute.typing.Coord,
*,
loc=None,
ip=None,
)#
cutlass.cute.filter_zeros(input, *, target_profile=None, loc=None, ip=None)#

Filter out zeros from a layout or tensor.

This function removes zero-stride dimensions from a layout or tensor. See Section 3.3 in the CuTe Whitepaper for more details on layout operations.

Parameters:
  • input (Layout or Tensor) – The input layout or tensor to filter

  • target_profile (optional) – Target profile for the filtered result, defaults to None

  • loc (optional) – Source location for MLIR, defaults to None

  • ip (optional) – Insertion point, defaults to None

Returns:

The filtered layout or tensor with zeros removed

Return type:

Layout or Tensor

Raises:

TypeError – If input is not a Layout or Tensor

cutlass.cute.filter(
input: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
)#

Filter a layout or tensor.

This function filters a layout or tensor according to CuTe’s filtering rules.

Parameters:
  • input (Layout or Tensor) – The input layout or tensor to filter

  • loc (optional) – Source location for MLIR, defaults to None

  • ip (optional) – Insertion point, defaults to None

Returns:

The filtered layout or tensor

Return type:

Layout or Tensor

Raises:

TypeError – If input is not a Layout or Tensor

cutlass.cute.tile_to_shape(
atom: cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster,
trg_shape: cutlass.cute.typing.Shape,
order: cutlass.cute.typing.Shape,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster#
cutlass.cute.shape_div(
lhs: cutlass.cute.typing.Shape,
rhs: cutlass.cute.typing.Shape,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Shape#

Perform element-wise division of shapes.

This function performs element-wise division between two shapes.

Parameters:
  • lhs (Shape) – Left-hand side shape

  • rhs (Shape) – Right-hand side shape

  • loc (optional) – Source location for MLIR, defaults to None

  • ip (optional) – Insertion point, defaults to None

Returns:

The result of element-wise division

Return type:

Shape

cutlass.cute.composition(
lhs,
rhs: cutlass.cute.typing.Layout | cutlass.cute.typing.Shape | cutlass.cute.typing.Tile,
*,
loc=None,
ip=None,
)#

Compose two layout representations using the CuTe layout algebra.

Compose a left-hand layout (or tensor) with a right-hand operand into a new layout R, such that for every coordinate c in the domain of the right-hand operand, the composed layout satisfies:

R(c) = A(B(c))

where A is the left-hand operand provided as lhs and B is the right-hand operand provided as rhs. In this formulation, B defines the coordinate domain while A applies its transformation to B’s output, and the resulting layout R inherits the stride and shape adjustments from A.

Satisfies:

cute.shape(cute.composition(lhs, rhs)) is compatible with cute.shape(rhs)

Parameters:
  • lhs (Layout or Tensor) – The left-hand operand representing the transformation to be applied.

  • rhs (Layout, Shape, or Tile, or int or tuple) – The right-hand operand defining the coordinate domain. If provided as an int or tuple, it will be converted to a tile layout.

  • loc (optional) – Optional location information for IR diagnostics.

  • ip (optional) – Optional instruction pointer or context for underlying IR functions.

Returns:

A new composed layout R, such that for all coordinates c in the domain of rhs, R(c) = lhs(rhs(c)).

Return type:

Layout or Tensor

Example:

import cutlass.cute as cute
@cute.jit
def foo():
    # Create a layout that maps (i,j) to i*4 + j
    L1 = cute.make_layout((2, 3), stride=(4, 1))
    # Create a layout that maps (i,j) to i*3 + j
    L2 = cute.make_layout((3, 4), stride=(3, 1))
    # Compose L1 and L2
    L3 = cute.composition(L1, L2)
    # L3 now maps coordinates through L2 then L1
cutlass.cute.complement(
input: cutlass.cute.typing.Layout,
cotarget: cutlass.cute.typing.Layout | cutlass.cute.typing.Shape,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Layout#

Compute the complement layout of the input layout with respect to the cotarget.

The complement of a layout A with respect to cotarget n is a layout A* such that for every k in Z_n and c in the domain of A, there exists a unique c* in the domain of A* where k = A(c) + A*(c*).

This operation is useful for creating layouts that partition a space in complementary ways, such as row and column layouts that together cover a matrix.

Parameters:
  • input (Layout) – The layout to compute the complement of

  • cotarget (Union[Layout, Shape]) – The target layout or shape that defines the codomain

  • loc (optional) – Optional location information for IR diagnostics

  • ip (optional) – Optional instruction pointer or context for underlying IR functions

Returns:

The complement layout

Return type:

Layout

Example:

import cutlass.cute as cute
@cute.jit
def foo():
    # Create a right-major layout for a 4x4 matrix
    row_layout = cute.make_layout((4, 4), stride=(4, 1))
    # Create a left-major layout that complements the row layout
    col_layout = cute.complement(row_layout, 16)
    # The two layouts are complementary under 16
cutlass.cute.right_inverse(
input: cutlass.cute.typing.Layout,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Layout#
cutlass.cute.left_inverse(
input: cutlass.cute.typing.Layout,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Layout#
cutlass.cute.max_common_layout(
a: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor,
b: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Layout#
cutlass.cute.max_common_vector(
a: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor,
b: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) int#
cutlass.cute.is_congruent(
a: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor,
b: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor,
) bool#

Returns whether a is congruent to b.

cutlass.cute.is_weakly_congruent(
a: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor,
b: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor,
) bool#

Returns whether a is weakly congruent to b.

cutlass.cute.logical_product(
block,
tiler: cutlass.cute.typing.Layout,
*,
loc=None,
ip=None,
)#
cutlass.cute.zipped_product(
block,
tiler: cutlass.cute.typing.Layout,
*,
loc=None,
ip=None,
)#
cutlass.cute.tiled_product(
block,
tiler: cutlass.cute.typing.Layout,
*,
loc=None,
ip=None,
)#
cutlass.cute.flat_product(
block,
tiler: cutlass.cute.typing.Layout,
*,
loc=None,
ip=None,
)#
cutlass.cute.raked_product(
block,
tiler: cutlass.cute.typing.Layout,
*,
loc=None,
ip=None,
)#
cutlass.cute.blocked_product(
block,
tiler: cutlass.cute.typing.Layout,
*,
loc=None,
ip=None,
)#
cutlass.cute.flat_divide(
target,
tiler: cutlass.cute.typing.Tiler,
*,
loc=None,
ip=None,
)#
cutlass.cute.logical_divide(
target,
tiler: cutlass.cute.typing.Tiler,
*,
loc=None,
ip=None,
)#
cutlass.cute.zipped_divide(
target,
tiler: cutlass.cute.typing.Tiler,
*,
loc=None,
ip=None,
)#
cutlass.cute.tiled_divide(
target,
tiler: cutlass.cute.typing.Tiler,
*,
loc=None,
ip=None,
)#
cutlass.cute.local_partition(
target: cutlass.cute.typing.Tensor,
tiler: cutlass.cute.typing.Layout | cutlass.cute.typing.Shape,
index,
proj: cutlass.cute.typing.XTuple = 1,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Tensor#
cutlass.cute.local_tile(
input: cutlass.cute.typing.Tensor,
tiler: cutlass.cute.typing.Layout | cutlass.cute.typing.Shape,
coord: cutlass.cute.typing.Coord,
proj: cutlass.cute.typing.XTuple | None = None,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Tensor#
cutlass.cute.make_mma_atom(
op: MmaOp,
*,
loc=None,
ip=None,
**kwargs,
) MmaAtom#

Makes an MMA Atom from an MMA Operation.

This function creates an MMA Atom from a given MMA Operation. Arbitrary kw arguments can be provided for Op-specific additional parameters. They are not used as of today.

Parameters:

op (MmaOp) – The MMA Operation to construct an Atom for

Returns:

The MMA Atom

Return type:

MmaAtom

cutlass.cute.make_tiled_mma(
op_or_atom: Op | MmaAtom,
atom_layout_mnk=(1, 1, 1),
permutation_mnk=None,
*,
loc=None,
ip=None,
**kwargs,
) TiledMma#

Makes a tiled MMA from an MMA Operation or an MMA Atom.

Parameters:
  • op_or_atom (Union[Op, MmaAtom]) – The MMA Operation or Atom

  • atom_layout_mnk (Layout) – A Layout describing the tiling of Atom across threads

  • permutation_mnk (Tiler) – A permutation Tiler describing the tiling of Atom across values including any permutation of such tiling

Returns:

The resulting tiled MMA

Return type:

TiledMma

cutlass.cute.make_copy_atom(
op: CopyOp,
copy_internal_type: Type[cutlass.cute.typing.Numeric],
*,
loc=None,
ip=None,
**kwargs,
) CopyAtom#

Makes a Copy Atom from a Copy Operation.

This function creates a Copy Atom from a given Copy Operation. Arbitrary kw arguments can be provided for Op-specific additional parameters.

Example:

op = cute.nvgpu.CopyUniversalOp()
atom = cute.make_copy_atom(op, tensor_dtype, num_bits_per_copy=64)
Parameters:
  • op (CopyOp) – The Copy Operation to construct an Atom for

  • copy_internal_type (Type[Numeric]) – An internal data type used to construct the source/destination layouts in unit of tensor elements

Returns:

The Copy Atom

Return type:

CopyAtom

cutlass.cute.make_tiled_copy_tv(
atom,
thr_layout,
val_layout,
*,
loc=None,
ip=None,
) TiledCopy#

Create a tiled copy given separate thr and val layouts. A TV partitioner is inferred based on inputs. Requires input thr layout be compact.

Parameters:
  • atom (copy atom, e.g. smit_copy and simt_async_copy, tma_load, etc.)

  • thr_layout (mn -> tid (need to be compact?))

  • val_layout (mn -> vid)

  • loc (source location for mlir (optional))

  • ip (insertion point (optional))

Returns:

A tiled copy for partitioner

Return type:

tiled_copy

cutlass.cute.make_tiled_copy(atom, layout_tv, tiler_mn, *, loc=None, ip=None)#

Create a tiled type given a TV partitioner and tiler

Parameters:
  • atom (copy atom, e.g. smit_copy and simt_async_copy, tma_load, etc.)

  • layout_tv (thread-value layout.)

  • tiler_mn (tile size (??))

  • loc (source location for mlir (optional))

  • ip (insertion point (optional))

Returns:

A tuple of A tiled copy and atom

Return type:

tiled_copy

cutlass.cute.make_tiled_copy_C_atom(
atom: CopyAtom,
mma: TiledMma,
*,
loc=None,
ip=None,
)#

Create the smallest tiled copy that can retile LayoutC_TV for use with pipelined epilogues with subtiled stores

Parameters:
  • atom (CopyAtom)

  • mma (TiledMma)

  • loc (source location for mlir (optional))

  • ip (insertion point (optional))

Returns:

A tiled copy for partitioner

Return type:

tiled_copy

cutlass.cute.basic_copy(
src: cutlass.cute.typing.Tensor,
dst: cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) None#

Performs a basic element-wise copy.

This functions assumes the following pre-conditions: 1. size(src) == size(dst)

When the src and dst shapes are static, the pre-conditions are actually verified and the element-wise loop is fully unrolled.

cutlass.cute.basic_copy_if(
pred: cutlass.cute.typing.Tensor,
src: cutlass.cute.typing.Tensor,
dst: cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) None#

Performs a basic predicated element-wise copy.

This functions assumes the following pre-conditions: 1. size(src) == size(dst) 2. size(src) == size(pred)

When all shapes are static, the pre-conditions are actually verified and the element-wise loop is fully unrolled.

cutlass.cute.autovec_copy(
src: cutlass.cute.typing.Tensor,
dst: cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) None#

Auto-vectorizing SIMT copy policy.

Given a source and destination tensors that are statically shaped, this policy figures out the largest safe vector width that the copy instruction can take and performs the copy.

cutlass.cute.copy(
atom: CopyAtom,
src: cutlass.cute.typing.Tensor,
dst: cutlass.cute.typing.Tensor,
*,
pred: cutlass.cute.typing.Tensor | None = None,
loc=None,
ip=None,
**kwargs,
) None#

The Copy algorithm.

The “copy with Atom” expects source and destination tensors to be partitioned according to the provided Copy Atom. Some Atoms require additional Op-specific kw arguments, for example TMA copies:

cute.copy(tma_atom, src, dst, tma_bar_ptr=mbar_ptr, mcast_mask=mask)

An additional predication tensor can be provided. If the partitioned tensors have the following logical profile ((ATOM_V,ATOM_REST),REST_M,...), the predication tensor must have a profile consistent with (ATOM_REST,REST_M,...).

cutlass.cute.gemm(
atom: MmaAtom,
d: cutlass.cute.typing.Tensor,
a: cutlass.cute.typing.Tensor,
b: cutlass.cute.typing.Tensor,
c: cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
**kwargs,
) None#

The GEMM algorithm.

Computes D <- AB + C where C and D can alias. Note that some MMA Atoms (e.g. warpgroup-wide or tcgen05 MMAs) require manually setting an “accumulate” boolean field.

All tensors must be partitioned according to the provided MMA Atom.

cutlass.cute.full(
shape,
fill_value,
dtype: Type[cutlass.cute.typing.Numeric],
*,
loc=None,
ip=None,
) TensorSSA#

Return a new TensorSSA of given shape and type, filled with fill_value.

Parameters:
  • shape (tuple) – Shape of the new tensor.

  • fill_value (scalar) – Value to fill the tensor with.

  • dtype (Type[Numeric]) – Data type of the tensor.

Returns:

Tensor of fill_value with the specified shape and dtype.

Return type:

TensorSSA

cutlass.cute.full_like(
a: TensorSSA,
fill_value,
dtype: Type[cutlass.cute.typing.Numeric] | None = None,
*,
loc=None,
ip=None,
) TensorSSA#

Return a full TensorSSA with the same shape and type as a given array.

Parameters:
  • a (array_like) – The shape and data-type of a define these same attributes of the returned array.

  • fill_value (array_like) – Fill value.

  • dtype (Union[None, Type[Numeric]], optional) – Overrides the data type of the result, defaults to None

Returns:

Tensor of fill_value with the same shape and type as a.

Return type:

TensorSSA

See also

empty_like(): Return an empty array with shape and type of input. ones_like(): Return an array of ones with shape and type of input. zeros_like(): Return an array of zeros with shape and type of input. full(): Return a new array of given shape filled with value.

Examples

frg = cute.make_fragment(Float32, (2, 3))
a = frg.load()
b = cute.full_like(a, 1.0)
cutlass.cute.empty_like(a, dtype=None)#

Return a new TensorSSA with the same shape and type as a given array, without initializing entries.

Parameters:
  • a (TensorSSA) – The shape and data-type of a define these same attributes of the returned array.

  • dtype (Type[Numeric], optional) – Overrides the data type of the result, defaults to None

Returns:

Uninitialized tensor with the same shape and type (unless overridden) as a.

Return type:

TensorSSA

cutlass.cute.ones_like(a, dtype=None)#

Return a TensorSSA of ones with the same shape and type as a given array.

Parameters:
  • a (TensorSSA) – The shape and data-type of a define these same attributes of the returned array.

  • dtype (Type[Numeric], optional) – Overrides the data type of the result, defaults to None

Returns:

Tensor of ones with the same shape and type (unless overridden) as a.

Return type:

TensorSSA

cutlass.cute.zeros_like(a, dtype=None, *, loc=None, ip=None)#

Return a TensorSSA of zeros with the same shape and type as a given array.

Parameters:
  • a (TensorSSA) – The shape and data-type of a define these same attributes of the returned array.

  • dtype (Type[Numeric], optional) – Overrides the data type of the result, defaults to None

Returns:

Tensor of zeros with the same shape and type (unless overridden) as a.

Return type:

TensorSSA

cutlass.cute.where(
cond: TensorSSA,
x: TensorSSA,
y: TensorSSA,
*,
loc=None,
ip=None,
) TensorSSA#

Return elements chosen from x or y depending on condition.

Parameters:
  • cond (TensorSSA) – Where True, yield x, where False, yield y.

  • x (TensorSSA) – Values from which to choose when condition is True.

  • y (TensorSSA) – Values from which to choose when condition is False.

Returns:

A tensor with elements from x where condition is True, and elements from y where condition is False.

Return type:

TensorSSA

cutlass.cute.any_(
x: TensorSSA,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Boolean#

Test whether any tensor element evaluates to True.

Parameters:

x (TensorSSA) – Input tensor.

Returns:

Returns a TensorSSA scalar containing True if any element of x is True, False otherwise.

Return type:

TensorSSA

cutlass.cute.all_(
x: TensorSSA,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Boolean#

Test whether all tensor elements evaluate to True.

Parameters:

x (TensorSSA) – Input tensor.

Returns:

Returns a TensorSSA scalar containing True if all elements of x are True, False otherwise.

Return type:

TensorSSA

cutlass.cute.repeat_like(x, target)#

Creates an object congruent to target and filled with x.

This function recursively creates a nested tuple structure that matches the structure of the target, with each leaf node filled with the value x.

Parameters:
  • x (Any) – The value to fill the resulting structure with

  • target (Union[tuple, Any]) – The structure to mimic

Returns:

A structure matching target but filled with x

Return type:

Union[tuple, Any]

Examples:

repeat_like(0, (1, 2, 3))      # Returns (0, 0, 0)
repeat_like(1, ((1, 2), 3))    # Returns ((1, 1), 1)
repeat_like(2, 5)              # Returns 2
class cutlass.cute.ScaledBasis(value, mode)#

Bases: object

A class representing a scaled basis element in CuTe’s layout algebra.

ScaledBasis is used to represent elements in the layout algebra, particularly in the context of composition operations. It consists of a value (scale) and a mode that identifies mode of the basis element.

Parameters:
  • value (Union[int, Integer, Ratio, ir.Value]) – The scale value

  • mode (Union[int, List[int]]) – The mode identifying the basis element

Raises:

TypeError – If mode is not an integer or list of integers

Examples:

# Create a scaled basis with integer scale and mode
sb1 = ScaledBasis(2, 0)  # 2 * E(0)

# Create a scaled basis with a Ratio scale
sb2 = ScaledBasis(Ratio(1, 2), 1)  # (1/2) * E(1)

# Create a scaled basis with a list of modes
sb3 = ScaledBasis(4, [0, 1])  # 4 * E([0, 1])

# Scaled basis elements are commonly used in layout strides
layout = make_layout((4, 8), stride=(ScaledBasis(1, 0), ScaledBasis(1, 1)))

# This creates a layout with strides (1@0, 1@1) representing
# a coordinate system where each dimension has its own basis
__init__(value, mode) None#
is_static() bool#

Check if the value is statically known.

Returns:

True if the value is not a dynamic expression

Return type:

bool

to(dtype)#

Convert to another type.

Parameters:

dtype (type) – The target type for conversion

Returns:

The ScaledBasis converted to the specified type

Raises:

TypeError – If conversion to the specified type is not supported

property value#

Get the scale value.

Returns:

The scale value

property mode: List[int]#

Get the mode identifying the basis element.

Returns:

The mode as a list of integers

Return type:

List[int]

class cutlass.cute.struct(cls)#

Bases: object

Decorator to abstract C structure in Python DSL.

Usage: .. code-block:

# Supports base_dsl scalar int/float elements, array and nested struct:
@cute.struct
class complex:
    real : cutlass.Float32
    imag : cutlass.Float32
@cute.struct
class StorageA:
    mbarA : cute.struct.MemRange[cutlass.Int64, stage]
    compA : complex
    intA : cutlass.Int16

# Supports aligment for its elements:
@cute.struct
class StorageB:
    a: cute.struct.Align[
        cute.struct.MemRange[cutlass.Float32, size_a], 1024
    ]
    b: cute.struct.Align[
        cute.struct.MemRange[cutlass.Float32, size_b], 1024
    ]
    x: cute.struct.Align[cutlass.Int32, 16]
    compA: cute.struct.Align[complex, 16]

# Statically get size and alignment:
size = StorageB.__sizeof__()
align = StorageB.__alignof__()

# Allocate and referencing elements:
storage = allocator.allocate(StorageB)

storage.a[0] ...
storage.x ...
storage.compA.real ...
Parameters:

cls – The struct class with annotations.

Returns:

The decorated struct class.

class _MemRangeMeta(name, bases, dct)#

Bases: type

A metaclass for creating MemRange classes.

This metaclass is used to dynamically create MemRange classes with specific data types and sizes.

Variables:
  • _dtype – The data type of the MemRange.

  • _size – The size of the MemRange.

_dtype = None#
_size = None#
property size#
property elem_width#
property size_in_bytes#
class MemRange#

Bases: object

Defines a range of memory by MemRange[T, size].

class _MemRangeData(dtype, size, base)#

Bases: object

Represents a range of memory.

Parameters:
  • dtype – The data type.

  • size – The size of the memory range in bytes.

  • base – The base address of the memory range.

__init__(dtype, size, base)#

Initializes a new memory range.

Parameters:
  • dtype – The data type.

  • size – The size of the memory range in bytes.

  • base – The base address of the memory range.

data_ptr()#

Returns start pointer to the data in this memory range.

Returns:

A pointer to the start of the memory range.

Raises:

AssertionError – If the size of the memory range is not greater than zero.

get_tensor(layout, swizzle=None, dtype=None)#

Creates a tensor from the memory range.

Parameters:
  • layout – The layout of the tensor.

  • swizzle – Optional swizzle pattern.

  • dtype – Optional data type; defaults to the memory range’s data type if not specified.

Returns:

A tensor representing the memory range.

Raises:
  • TypeError – If the layout is incompatible with the swizzle.

  • AssertionError – If the size of the memory range is not greater than zero.

class _AlignMeta(name, bases, dct)#

Bases: type

Aligns the given object by setting its alignment attribute.

Parameters:
  • v – The object to align. Must be a struct, MemRange, or a scalar type.

  • align – The alignment value to set.

Returns:

A copy of the object with the specified alignment.

Raises:

TypeError – If the object is not a struct, MemRange, or a scalar type.

class Align#

Bases: object

Aligns the given type by Align[T, alignment].

static _is_scalar_type(dtype)#

Checks if the given type is a scalar numeric type.

Parameters:

dtype – The type to check.

Returns:

True if the type is a subclass of Numeric, False otherwise.

__init__(cls)#

Initializes a new struct decorator instance.

Parameters:

cls – The class representing the structured data type.

Raises:

TypeError – If the struct is empty.

size_in_bytes() int#

Returns the size of the struct in bytes.

Returns:

The size of the struct.