cutlass.cute#

class cutlass.cute.ComposedLayout(*args: Any, **kwargs: Any)#

Bases: Value

ComposedLayout represents the functional composition of layouts in CuTe.

A ComposedLayout is formed by the composition of three components: inner o offset o outer, where:

inner: The inner layout or swizzle that is applied last
offset: An integer tuple representing a coordinate offset
outer: The outer layout that is applied first

ComposedLayout implements the functional composition operation where: R(c) := (inner o offset o outer)(c) := inner(offset + outer(c))

This composition allows for complex transformations of coordinates and indices, enabling operations like tiling, partitioning, and reshaping of data.

Variables:

inner – The inner layout or swizzle component
offset – The coordinate offset applied between inner and outer layouts
outer – The outer layout component
max_alignment – The maximum alignment of the composed layout

Examples:

# Create a composed layout with inner layout, offset, and outer layout

# inner layout: (4, 8):(1, 4)
inner_layout = make_layout((4, 8))

offset = (0, 0)

# outer layout: (2, 2):(1@0, 1@1)
outer_layout = make_layout((2, 2), stride=(1 * E(0), 1 * E(1)))

# composed layout: (inner o offset o outer)
composed = make_composed_layout(inner_layout, offset, outer_layout)

# Accessing components of the composed layout
inner = composed.inner
offset = composed.offset
outer = composed.outer

# map coordinate (1, 2) to linear index
#  - outer(1, 2) = (1, 2)
#  - offset + outer(1, 2) = (1, 2)
#  - inner(1, 2) = 1 * 1 + 2 * 4 = 9
idx = crd2idx((1, 2), composed)

# Composition is used in many tiling operations
# For example, in logical_product, raked_product, and blocked_product

__init__(*args: Any, **kwargs: Any) → None#

class cutlass.cute.Swizzle(*args: Any, **kwargs: Any)#

Bases: Value

Swizzle is a transformation that permutes the elements of a layout.

Swizzles are used to rearrange data elements to improve memory access patterns and computational efficiency.

Swizzle is defined by three parameters: - MBase: The number of least-significant bits to keep constant - BBits: The number of bits in the mask - SShift: The distance to shift the mask

The mask is applied to the least-significant bits of the layout.

0bxxxxxxxxxxxxxxxYYYxxxxxxxZZZxxxx
                              ^--^ MBase is the number of least-sig bits to keep constant
                 ^-^       ^-^     BBits is the number of bits in the mask
                   ^---------^     SShift is the distance to shift the YYY mask
                                      (pos shifts YYY to the right, neg shifts YYY to the left)

e.g. Given
0bxxxxxxxxxxxxxxxxYYxxxxxxxxxZZxxx
the result is
0bxxxxxxxxxxxxxxxxYYxxxxxxxxxAAxxx where AA = ZZ xor YY

cutlass.cute.E(mode: int | List[int]) → ScaledBasis#

Create a unit ScaledBasis element with the specified mode.

This function creates a ScaledBasis with value 1 and the given mode. The mode represents the coordinate axis or dimension in the layout.

Parameters:: mode (Union[int, List[int]]) – The mode (dimension) for the basis element, either a single integer or a list of integers
Returns:: A ScaledBasis with value 1 and the specified mode
Return type:: ScaledBasis
Raises:: TypeError – If mode is not an integer or a list

Examples:

# Create a basis element for the first dimension (mode 0)
e0 = E(0)

# Create a basis element for the second dimension (mode 1)
e1 = E(1)

# Create a basis element for a hierarchical dimension
e_hier = E([0, 1])

class cutlass.cute.Atom(op: Op, trait: Trait)#

Bases: ABC

Atom base class.

An Atom is the composition of

a MMA or Copy Operation;
an internal MMA or Copy Trait.

An Operation is a pure Python class that is used to model a specific MMA or Copy instruction. The Trait wraps the underlying IR Value and provides access to the metadata of the instruction encoded using CuTe Layouts. When the Trait can be constructed straighforwardly from an Operation, the make_mma_atom or make_copy_atom API should be used. There are cases where constructing the metadata is not trivial and requires more information, for example to determine the number of bytes copied per TMA instruction (“the TMA vector length”). In such cases, dedicated helper functions are provided with an appropriate API such that the Atom is constructed internally in an optimal fashion for the user.

__init__( op: Op, trait: Trait, ) → None#

property op: Op#

property type#

set(modifier, value, *, loc=None, ip=None) → None#

Sets runtime fields of the Atom.

Some Atoms have runtime state, for example a tcgen05 MMA Atom

tiled_mma = cute.make_tiled_mma(some_tcgen05_mma_op)
tiled_mma.set(cute.nvgpu.tcgen05.Field.ACCUMULATE, True)

The set method provides a way to the user to modify such runtime state. Modifiable fields are provided by arch-specific enumerations, for example tcgen05.Field. The Atom instance internally validates the field as well as the value provided by the user to set the field to.

_unpack(*, loc=None, ip=None, **kwargs) → cutlass._mlir.ir.Value#

_abc_impl = <_abc._abc_data object>#

class cutlass.cute.MmaAtom(op: Op, trait: Trait)#

Bases: Atom

The MMA Atom class.

property thr_id: cutlass.cute.typing.Layout#

property shape_mnk: cutlass.cute.typing.Shape#

property tv_layout_A: cutlass.cute.typing.Layout#

property tv_layout_B: cutlass.cute.typing.Layout#

property tv_layout_C: cutlass.cute.typing.Layout#

make_fragment_A(input, *, loc=None, ip=None)#

make_fragment_B(input, *, loc=None, ip=None)#

make_fragment_C(input, *, loc=None, ip=None)#

_abc_impl = <_abc._abc_data object>#

class cutlass.cute.CopyAtom(op: Op, trait: Trait)#

Bases: Atom

The Copy Atom class.

property value_type: Type[cutlass.cute.typing.Numeric]#

property thr_id: cutlass.cute.typing.Layout#

property layout_src_tv: cutlass.cute.typing.Layout#

property layout_dst_tv: cutlass.cute.typing.Layout#

_abc_impl = <_abc._abc_data object>#

class cutlass.cute.TiledCopy(op: Op, trait: Trait)#

Bases: CopyAtom

The tiled Copy class.

property layout_tv_tiled: cutlass.cute.typing.Layout#

property tiler_mn: cutlass.cute.typing.Tile#

property layout_src_tv_tiled: cutlass.cute.typing.Layout#

property layout_dst_tv_tiled: cutlass.cute.typing.Layout#

property size: int#

get_slice( thr_idx: int | cutlass.cute.typing.Int32, ) → ThrCopy#

retile(src, *, loc=None, ip=None)#

_abc_impl = <_abc._abc_data object>#

class cutlass.cute.TiledMma(op: Op, trait: Trait)#

Bases: MmaAtom

The tiled MMA class.

property tv_layout_A_tiled: cutlass.cute.typing.Layout#

property tv_layout_B_tiled: cutlass.cute.typing.Layout#

property tv_layout_C_tiled: cutlass.cute.typing.Layout#

property permutation_mnk: cutlass.cute.typing.Tile#

property thr_layout_vmnk: cutlass.cute.typing.Layout#

property size: int#

get_tile_size(mode_idx: int) → cutlass.cute.typing.Shape#

get_slice( thr_idx: int | cutlass.cute.typing.Int32, ) → ThrMma#

_partition_shape(operand_id, shape, *, loc=None, ip=None)#

partition_shape_A(shape_mk, *, loc=None, ip=None)#

partition_shape_B(shape_nk, *, loc=None, ip=None)#

partition_shape_C(shape_mn, *, loc=None, ip=None)#

_thrfrg( operand_id, input: cutlass.cute.typing.Layout, *, loc=None, ip=None, ) → cutlass.cute.typing.Layout#
_thrfrg( operand_id, input: cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → cutlass.cute.typing.Tensor

_thrfrg_A( input: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor#

_thrfrg_B( input: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor#

_thrfrg_C( input: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor#

_abc_impl = <_abc._abc_data object>#

class cutlass.cute.TensorSSA(*args: Any, **kwargs: Any)#

Bases: ArithValue

A class representing thread local data from CuTe Tensor in value semantic and immutable.

Parameters:

value (ir.Value) – Flatten vector as ir.Value holding logic data of SSA Tensor
shape (Shape) – The nested shape in CuTe of the vector
dtype (Type[Numeric]) – Data type of the tensor elements

Variables:

_shape – The nested shape in CuTe of the vector
_dtype – Data type of the tensor elements

Raises:

ValueError – If shape is not static

__init__( value, shape: cutlass.cute.typing.Shape, dtype: Type[cutlass.cute.typing.Numeric], )#

Initialize a new TensorSSA object.

Parameters:

value (ir.Value) – Flatten vector as ir.Value holding logic data of SSA Tensor
shape (Shape) – The nested shape in CuTe of the vector
dtype (Type[Numeric]) – Data type of the tensor elements

Raises:

ValueError – If shape is not static

property dtype: Type[cutlass.cute.typing.Numeric]#

property element_type: Type[cutlass.cute.typing.Numeric]#

property shape#

_apply_op( op, other: TensorSSA, flip, *, loc, ip, ) → TensorSSA#
_apply_op( op, other: cutlass.cutlass_dsl.cutlass_arith.ArithValue, flip, *, loc, ip, ) → TensorSSA
_apply_op( op, other: int | float | bool, flip, *, loc, ip, ) → TensorSSA

_flatten_shape_and_coord(crd, *, loc=None, ip=None)#

_build_result(res_vect, res_shp, *, loc=None, ip=None)#

to( dtype: Type[cutlass.cute.typing.Numeric], *, loc=None, ip=None, )#

Convert the tensor to a different numeric type.

Parameters:

dtype (Type[Numeric]) – The target numeric type to cast to.

Returns:

A new tensor with the same shape but with elements cast to the target type.

Return type:

TensorSSA

Raises:

TypeError – If dtype is not a subclass of Numeric.
NotImplementedError – If dtype is an unsigned integer type.

ir_value(*, loc=None, ip=None)#

reduce( op, init_val, reduction_profile: cutlass.cute.typing.Coord, *, loc=None, ip=None, )#

Perform reduce on selected modes with given predefined reduction op.

Parameters:

op (operator) – The reduction operator to use (operator.add or operator.mul)
init_val (numeric) – The initial value for the reduction
reduction_profile (Coord) – Specifies which dimensions to reduce. Dimensions marked with ‘_’ are kept.

Returns:

The reduced tensor

Return type:

TensorSSA

Examples

reduce(f32 o (4,)): => f32
reduce(f32 o (4, 5)): => f32
reduce(f32 o (4, (5, 4)), reduction_profile=(_, 1)): => f32 o (4,)
reduce(f32 o (4, (5, 4)), reduction_profile=(_, (_, 1))): => f32 o (4, (5,))

cutlass.cute.assume(src, divby=None, *, loc=None, ip=None)#

cutlass.cute.is_integer(a) → bool#: Check if an object is static integer or dynamic integer

cutlass.cute.is_int_tuple(a) → bool#

cutlass.cute.is_static( x: cutlass._mlir.ir.Type | cutlass._mlir.ir.Value | cutlass.cute.typing.XTuple, ) → bool#

Check if a value is statically known at compile time.

In CuTe, static values are those whose values are known at compile time, as opposed to dynamic values which are only known at runtime.

Parameters:: x (Union[ir.Type, ir.Value, XTuple]) – The value to check
Returns:: True if the value is static, False otherwise
Return type:: bool
Raises:: TypeError – If an unsupported type is provided

cutlass.cute.size( a: cutlass.cute.typing.IntTuple | cutlass.cute.typing.Shape | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor, mode: List[int] = [], *, loc=None, ip=None, ) → cutlass.cute.typing.Int#

Return size of domain of layout or tensor.

Computes the size (number of elements) in the domain of a layout or tensor. For layouts, this corresponds to the shape of the coordinate space. See Section 3.2 in the CuTe Whitepaper for more details on layout domains.

Parameters:

a (IntTuple, Shape, Layout, ComposedLayout or Tensor) – The input object whose size to compute
mode (list of int, optional) – List of mode(s) for size calculation. If empty, computes total size, defaults to []
loc (optional) – Source location for MLIR, defaults to None
ip (optional) – Insertion point, defaults to None

Returns:

Static size of layout or tensor if static, otherwise a Value

Return type:

int or Value

Raises:

ValueError – If mode contains non-integer elements

cutlass.cute.has_underscore(a: cutlass.cute.typing.XTuple) → bool#

cutlass.cute.slice_(src, coord: cutlass.cute.typing.Coord, *, loc=None, ip=None)#

Perform a slice operation on a source object using the given coordinate.

This function implements CuTe’s slicing operation which extracts a subset of elements from a source object (tensor, layout, etc.) based on a coordinate pattern. The slice operation preserves the structure of the source while selecting specific elements.

Parameters:

src (Union[Tensor, Layout, IntTuple, Value]) – Source object to be sliced (tensor, layout, tuple, etc.)
coord (Coord) – Coordinate pattern specifying which elements to select
loc (Optional[Location]) – Source location information, defaults to None
ip (Optional[InsertionPoint]) – Insertion point for IR generation, defaults to None

Returns:

A new object containing the sliced elements

Return type:

Union[Tensor, Layout, IntTuple, tuple]

Raises:

ValueError – If the coordinate pattern is incompatible with source

Examples:

# Layout slicing
layout = make_layout((4,4))

# Select 1st index of first mode and keep all elements in second mode
sub_layout = slice_(layout, (1, None))

# Basic tensor slicing
tensor = make_tensor(...)           # Create a 2D tensor

# Select 1st index of first mode and keep all elements in second mode
sliced = slice_(tensor, (1, None))

# Select 2nd index of second mode and keep all elements in first mode
sliced = slice_(tensor, (None, 2))

Note

None represents keeping all elements in that mode
Slicing preserves the layout/structure of the original object
Can be used for: * Extracting sub-tensors/sub-layouts * Creating views into data * Selecting specific patterns of elements

cutlass.cute.depth( a: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster, ) → int#

Returns the depth (nesting level) of a tuple, layout, or tensor.

The depth of a tuple is the maximum depth of its elements plus 1. For an empty tuple, the depth is 1. For layouts and tensors, the depth is determined by the depth of their shape. For non-tuple values (e.g., integers), the depth is considered 0.

Parameters:: a (Union[XTuple, Layout, ComposedLayout, Tensor, Any]) – The object whose depth is to be determined
Returns:: The depth of the input object
Return type:: int

Example:

>>> depth(1)
0
>>> depth((1, 2))
1
>>> depth(((1, 2), (3, 4)))
2

cutlass.cute.rank( a: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster, ) → int#

Returns the rank (dimensionality) of a tuple, layout, or tensor.

The rank of a tuple is its length. For layouts and tensors, the rank is determined by the rank of their shape. For non-tuple values (e.g., integers), the rank is considered 1 for convenience.

Parameters:: a (Union[XTuple, Layout, ComposedLayout, Tensor, Any]) – The object whose rank is to be determined
Returns:: The rank of the input object
Return type:: int

This function is used in layout algebra to determine the dimensionality of tensors and layouts for operations like slicing and evaluation.

cutlass.cute.printf(*args, loc=None, ip=None) → None#

Print a value or a list of values.

Parameters:

args (list) – List of values to print
loc (source location, optional) – Source location where it’s called, defaults to None
ip (insertion pointer, optional) – Insertion pointer, defaults to None

Raises:

ValueError – If no arguments are provided or if an unsupported argument type is passed

cutlass.cute.print_tensor( tensor: cutlass.cute.typing.Tensor, *, verbose: bool = False, loc=None, ip=None, )#

Print content of the tensor in human readable format.

tensor(raw_ptr<@…, Float32, generic, align(4)> o (8,5):(5,1), data=

[[-0.4326, -0.5434, 0.1238, 0.7132, 0.8042],: [-0.8462, 0.9871, 0.4389, 0.7298, 0.6948], [ 0.3426, 0.5856, 0.1541, 0.2923, 0.6976], [-0.1649, 0.8811, 0.1788, 0.1404, 0.2568], [-0.2944, 0.8593, 0.4171, 0.8998, 0.1766], [ 0.8814, 0.7919, 0.7390, 0.4566, 0.1576], [ 0.9159, 0.7577, 0.6918, 0.0754, 0.0591], [ 0.6551, 0.1626, 0.1189, 0.0292, 0.8655]])

cutlass.cute.pretty_str(arg) → str#: Constructs a concise readable pretty string.

cutlass.cute.make_layout( shape: cutlass.cute.typing.Shape, *, stride: cutlass.cute.typing.Stride | None = None, loc=None, ip=None, ) → cutlass.cute.typing.Layout#

Create a CuTe Layout object from shape and optional stride information.

A Layout in CuTe represents the mapping between logical and physical coordinates of a tensor. This function creates a Layout object that defines how tensor elements are arranged in memory.

Parameters:

shape (Shape) – Shape of the layout defining the size of each mode
stride (Union[Stride, None]) – Optional stride values for each mode, defaults to None
loc (Optional[Location]) – Source location information, defaults to None
ip (Optional[InsertionPoint]) – Insertion point for IR generation, defaults to None

Returns:

A new Layout object with the specified shape and stride

Return type:

Layout

Examples:

# Create a 2D compact left-most layout with shape (4,4)
layout = make_layout((4,4))                     # compact left-most layout

# Create a left-most layout with custom strides
layout = make_layout((4,4), stride=(1,4))       # left-most layout with strides (1,4)

# Create a layout for a 3D tensor
layout = make_layout((32,16,8))                 # left-most layout

# Create a layout with custom strides
layout = make_layout((2,2,2), stride=(4,1,2))   # layout with strides (4,1,2)

Note

If stride is not provided, a default compact left-most stride is computed based on the shape
The resulting layout maps logical coordinates to physical memory locations
The layout object can be used for tensor creation and memory access patterns
Strides can be used to implement: * Row-major vs column-major layouts * Padding and alignment * Blocked/tiled memory arrangements * Interleaved data formats
Stride is keyword only argument to improve readability, e.g. * make_layout((3,4), (1,4)) can be confusing with make_layout(((3,4), (1,4))) * make_layout((3,4), stride=(1,4)) is more readable

cutlass.cute.recast_layout( new_type_bits, old_type_bits, src_layout, *, loc=None, ip=None, )#

cutlass.cute.make_identity_layout( shape: cutlass.cute.typing.Shape, *, loc=None, ip=None, ) → cutlass.cute.typing.Layout#

Create an identity layout with the given shape.

An identity layout maps logical coordinates directly to themselves without any transformation. This is equivalent to a layout with stride (1@0,1@1,…,1@(N-1)).

Parameters:

shape (Shape) – The shape of the layout
loc (Optional[Location]) – Source location information, defaults to None
ip (Optional[InsertionPoint]) – Insertion point for IR generation, defaults to None

Returns:

A new identity Layout object with the specified shape

Return type:

Layout

Examples:

# Create a 2D identity layout with shape (4,4)
layout = make_identity_layout((4,4))     # stride=(1@0,1@1)

# Create a 3D identity layout
layout = make_identity_layout((32,16,8)) # stride=(1@0,1@1,1@2)

Note

An identity layout is a special case where each coordinate maps to itself
Useful for direct coordinate mapping without any transformation

cutlass.cute.make_ordered_layout( shape: cutlass.cute.typing.Shape, order: cutlass.cute.typing.Shape, *, loc=None, ip=None, ) → cutlass.cute.typing.Layout#

Create a layout with a specific ordering of dimensions.

This function creates a layout where the dimensions are ordered according to the specified order parameter, allowing for custom dimension ordering in the layout.

Parameters:

shape (Shape) – The shape of the layout
order (Shape) – The ordering of dimensions
loc (Optional[Location]) – Source location information, defaults to None
ip (Optional[InsertionPoint]) – Insertion point for IR generation, defaults to None

Returns:

A new Layout object with the specified shape and dimension ordering

Return type:

Layout

Examples:

# Create a row-major layout
layout = make_ordered_layout((4,4), order=(1,0))

# Create a column-major layout
layout = make_ordered_layout((4,4), order=(0,1))         # stride=(1,4)

# Create a layout with custom dimension ordering for a 3D tensor
layout = make_ordered_layout((32,16,8), order=(2,0,1))   # stride=(128,1,16)

Note

The order parameter specifies the ordering of dimensions from fastest-varying to slowest-varying
For a 2D tensor, (0,1) creates a column-major layout, while (1,0) creates a row-major layout
The length of order must match the rank of the shape

cutlass.cute.make_composed_layout( inner, offset: cutlass.cute.typing.IntTuple, outer: cutlass.cute.typing.Layout, *, loc=None, ip=None, ) → cutlass._mlir.ir.register_value_caster#

Create a composed layout by composing an inner transformation with an outer layout.

As described in the CuTe whitepaper, a composed layout applies a sequence of transformations to coordinates. The composition is defined as (inner ∘ offset ∘ outer), where the operations are applied from right to left.

Parameters:

inner (Union[Layout, Swizzle]) – The inner transformation (can be a Layout or Swizzle)
offset (IntTuple) – An integral offset applied between transformations
outer (Layout) – The outer (right-most) layout that is applied first
loc (Optional[Location]) – Source location information, defaults to None
ip (Optional[InsertionPoint]) – Insertion point for IR generation, defaults to None

Returns:

A new ComposedLayout representing the composition

Return type:

ComposedLayout

Examples:

# Create a basic layout
inner = make_layout(...)
outer = make_layout((4,4), stride=(E(0), E(1)))

# Create a composed layout with an offset
composed = make_composed_layout(inner, (2,0), outer)

Note

The composition applies transformations in the order: outer → offset → inner
The stride divisibility condition must be satisfied for valid composition
Certain compositions (like Swizzle with scaled basis) are invalid and will raise errors
Composed layouts inherit many properties from the outer layout

cutlass.cute.make_layout_tv( thr_layout: cutlass.cute.typing.Layout, val_layout: cutlass.cute.typing.Layout, *, loc=None, ip=None, ) → Tuple[cutlass.cute.typing.Shape, cutlass.cute.typing.Layout]#

Create a tiled copy given separate thr and val layouts. A TV partitioner is inferred based on inputs. Requires input thr layout be compact.

Parameters:

atom (copy atom, e.g. smit_copy and simt_async_copy, tma_load, etc.)
thr_layout (mn -> tid (need to be compact?))
val_layout (mn -> vid)
loc (source location for mlir (optional))
ip (insertion point (optional))

Returns:

layout_mn – logical tile size
layout_tv – thread-value layout (tid, vid) -> mn

cutlass.cute.make_layout_image_mask( lay: cutlass.cute.typing.Layout, coord: cutlass.cute.typing.Coord, mode: int, *, loc=None, ip=None, ) → cutlass.cute.typing.Int16#: Makes a 16-bit integer mask of the image of a layout sliced at a given mode and accounting for the offset given by the input coordinate for the other modes.

cutlass.cute.make_ptr( dtype: Type[cutlass.cute.typing.Numeric] | None, value, mem_space: cutlass.cute.typing.AddressSpace = cutlass.cute.typing.AddressSpace.generic, *, assumed_align=None, loc=None, ip=None, ) → cutlass.cute.typing.Pointer#

cutlass.cute.make_tensor( iterator, layout: cutlass.cute.typing.Shape | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster, *, loc=None, ip=None, ) → cutlass.cute.typing.Tensor#

Creates a tensor by composing an engine (iterator/pointer) with a layout.

A tensor is defined as T = E ∘ L, where E is an engine (array, pointer, or counting iterator) and L is a layout that maps logical coordinates to physical offsets. The tensor evaluates coordinates by applying the layout mapping and dereferencing the engine at the resulting offset.

Parameters:

iterator (Union[Pointer, IntTuple]) – Engine component (pointer, iterator, or counting iterator) that provides data access capabilities
layout (Union[Shape, Layout, ComposedLayout]) – Layout component that defines the mapping from logical coordinates to physical offsets
loc (Optional[Location]) – Source location for MLIR operation tracking, defaults to None
ip (Optional[InsertionPoint]) – Insertion point for MLIR operation, defaults to None

Returns:

A tensor object representing the composition E ∘ L

Return type:

Tensor

Raises:

ValueError – If iterator type is not supported

Examples:

# Create a tensor with row-major layout
layout = make_layout((64, 128), stride=(128, 1))
tensor = make_tensor(ptr, layout)

# Create a tensor with hierarchical layout
layout = make_layout(((128, 8), (1, 4, 1)), stride=((32, 1), (0, 8, 4096)))
tensor = make_tensor(smem_ptr, layout)

# Create a counting tensor
layout = make_layout(2, stride=16 * E(0))
tensor = make_tensor(5, layout)

Notes

The engine (iterator) must support random access operations
Common engine types include raw pointers, arrays, and random-access iterators
The layout defines both the shape (logical dimensions) and stride (physical mapping)
Supports both direct coordinate evaluation T(c) and partial evaluation (slicing)

cutlass.cute.make_identity_tensor( shape: cutlass.cute.typing.Shape, *, loc=None, ip=None, ) → cutlass.cute.typing.Tensor#

Creates an identity tensor with the given shape.

An identity tensor maps each coordinate to itself, effectively creating a counting sequence within the shape’s bounds. This is useful for generating coordinate indices or creating reference tensors for layout transformations.

Parameters:

shape (Shape) – The shape defining the tensor’s dimensions. Can be a simple integer sequence or a hierarchical structure ((m,n),(p,q))
loc (Optional[Location]) – Source location for MLIR operation tracking, defaults to None
ip (Optional[InsertionPoint]) – Insertion point for MLIR operation, defaults to None

Returns:

A tensor that maps each coordinate to itself

Return type:

Tensor

Examples:

# Create a simple 1D counting tensor
tensor = make_identity_tensor(6)  # [0,1,2,3,4,5]

# Create a 2D counting tensor
tensor = make_identity_tensor((3,2))  # [(0,0),(1,0),(2,0),(0,1),(1,1),(2,1)]

# Create hierarchical counting tensor
tensor = make_identity_tensor(((2,1),3))
# [((0,0),0),((1,0),0),((0,0),1),((1,0),1),((0,0),2),((1,0),2)]

Notes

The shape parameter follows CuTe’s IntTuple concept
Coordinates are ordered colexicographically
Useful for generating reference coordinates in layout transformations

cutlass.cute.make_fragment( layout_or_shape: cutlass.cute.typing.Layout | cutlass.cute.typing.Shape, dtype: Type[cutlass.cute.typing.Numeric], *, loc=None, ip=None, ) → cutlass.cute.typing.Tensor#

cutlass.cute.make_fragment_like(src, dtype=None, *, loc=None, ip=None)#

Create tensor with a compact layout in the same shape as the source on stack.

This function either creates a fragment tensor with compact layout in same shape as the source layout or a new layout with the same shape as the source. The strides of the new layout follow the order induced by the source’s strides, with a special handling of the 0th mode: it is always stride-1 and generated in column-major order (LayoutLeft).

Parameters:

src (Union[Layout, ComposedLayout, Tensor]) – The source layout or tensor whose shape will be matched
dtype (Type[Numeric], optional) – The element type for the fragment tensor, defaults to None
loc (Location, optional) – Source location for MLIR operations, defaults to None
ip (InsertionPoint, optional) – Insertion point for MLIR operations, defaults to None

Returns:

A new layout or fragment tensor with matching shape

Return type:

Union[Layout, Tensor]

Examples

Creating a rmem tensor from a tensor:

smem_tensor = cute.make_tensor(smem_ptr, layout)
frag_tensor = cute.make_fragment_like(smem_tensor, cutlass.Float32)
# frag_tensor will be a register-backed tensor with the same shape

Creating a fragment with a different element type:

tensor = cute.make_tensor(gmem_ptr, layout)
bool_frag = cute.make_fragment_like(tensor, cutlass.Boolean)
# bool_frag will be a register-backed tensor with Boolean elements

Notes

When used with a Tensor, if a type is provided, it will create a new fragment tensor with that element type.
For layouts with ScaledBasis strides, the function creates a fragment from the shape only.
This function is commonly used in GEMM and other tensor operations to create register storage for intermediate results.

cutlass.cute.recast_ptr( ptr: cutlass.cute.typing.Pointer, swizzle_=None, dtype: Type[cutlass.cute.typing.Numeric] | None = None, loc=None, ip=None, ) → cutlass.cute.typing.Pointer#

cutlass.cute.recast_tensor( src: cutlass.cute.typing.Tensor, dtype: Type[cutlass.cute.typing.Numeric], swizzle_=None, *, loc=None, ip=None, )#

cutlass.cute.get(input, mode: List[int], *, loc=None, ip=None)#

Extract a specific element or sub-layout from a layout or tuple.

This function recursively traverses the input according to the mode indices, extracting the element at the specified path. For layouts, this operation corresponds to extracting a specific sub-layout.

Parameters:

input (Layout, ComposedLayout, tuple) – The input layout or tuple to extract from
mode (List[int]) – Indices specifying the path to traverse for extraction
loc (optional) – Source location for MLIR, defaults to None
ip (optional) – Insertion point, defaults to None

Returns:

The extracted element or sub-layout

Return type:

Layout, ComposedLayout, or element type

Raises:

ValueError – If any index in mode is out of range
TypeError – If mode contains non-integer elements or if input has unsupported type

Examples:

For a layout like ((4,8),2):((16,1),8), get with mode=[0,1] would extract the element 8 from the shape component.

cutlass.cute.select(input, mode: List[int], *, loc=None, ip=None)#

Select modes from input.

Parameters:

input (Layout, ComposedLayout, tuple) – Input to select from
mode (List[int]) – Indices specifying which dimensions or elements to select
loc (optional) – Source location for MLIR, defaults to None
ip (optional) – Insertion point, defaults to None

Returns:

A new instance with selected dimensions/elements

Return type:

Layout, ComposedLayout, tuple

Raises:

ValueError – If any index in mode is out of range
TypeError – If the input type is invalid

cutlass.cute.front(input, *, loc=None, ip=None)#

Recursively get the first element of input.

This function traverses a hierarchical structure (like a layout or tensor) and returns the first element at the deepest level. It’s particularly useful for accessing the first stride value in a layout to determine properties like majorness.

Parameters:

input (Union[Tensor, Layout, Stride]) – The hierarchical structure to traverse
loc (source location, optional) – Source location where it’s called, defaults to None
ip (insertion pointer, optional) – Insertion pointer for IR generation, defaults to None

Returns:

The first element at the deepest level of the input structure

Return type:

Union[int, float, bool, ir.Value]

cutlass.cute.is_major( mode, stride: cutlass.cute.typing.Stride, *, loc=None, ip=None, ) → bool#: Check whether a mode in stride is the major mode.

Find the first position of a x in t. If exclude_when is provided, the positions where comparison equals comparison_value will be excluded from the search results.

Parameters:

x (int) – The static integer x to search for
t (Union[tuple, ir.Value, int]) – The search space
exclude_when (Optional[Tuple[int, Union[tuple, ir.Value, int]]]) – A tuple of (comparison_value, comparison) - positions where comparison equals comparison_value will be excluded from the search results

Returns:

Index if found at top level, tuple of indices showing nested position, or None if not found

Return type:

Union[int, Tuple[int, …], None]

cutlass.cute.coalesce( input, *, target_profile: cutlass.cute.typing.Coord | None = None, loc=None, ip=None, )#

cutlass.cute.group_modes(input, begin: int, end: int = -1, *, loc=None, ip=None)#

Group modes of a hierarchical tuple or layout into a single mode.

This function groups a range of modes from the input object into a single mode, creating a hierarchical structure. For tuples, it creates a nested tuple containing the specified range of elements. For layouts and other CuTe objects, it creates a hierarchical representation where the specified modes are grouped together.

Parameters:

input (Layout, ComposedLayout, tuple, Shape, Stride, etc.) – Input object to group modes from (layout, tuple, etc.)
beg (int) – Beginning index of the range to group (inclusive)
end (int) – Ending index of the range to group (exclusive)
loc (optional) – Source location for MLIR, defaults to None
ip (optional) – Insertion point, defaults to None

Returns:

A new object with the specified modes grouped

Return type:

Same type as input with modified structure

Examples:

# Group modes in a tuple
t = (2, 3, 4, 5)
grouped = group_modes(t, 1, 3)  # (2, (3, 4), 5)

# Group modes in a layout
layout = make_layout((2, 3, 4, 5))
grouped_layout = group_modes(layout, 1, 3)  # Layout with shape (2, (3, 4), 5)

# Group modes in a shape
shape = make_shape(2, 3, 4, 5)
grouped_shape = group_modes(shape, 0, 2)  # Shape ((2, 3), 4, 5)

cutlass.cute.cosize( a: cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor, mode: List[int] = [], *, loc=None, ip=None, )#

Return size of codomain of layout or tensor. Return static value if type is static.

Parameters:

a (Union[Layout, ComposedLayout, Tensor]) – Layout, ComposedLayout, or Tensor object
mode (List[int], optional) – List of mode(s) for cosize calculation
loc (optional) – Location information for diagnostics, defaults to None
ip (optional) – Instruction pointer for diagnostics, defaults to None

Returns:

Static size of layout or tensor (fast fold) if static, or a dynamic Value

Return type:

Union[int, Value]

cutlass.cute.size_in_bytes( dtype: Type[cutlass.cute.typing.Numeric], layout: cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster, *, loc=None, ip=None, )#

Calculate the size in bytes based on its data type and layout.

Parameters:

dtype (Type[Numeric]) – The DSL numeric data type
layout (Layout, optional) – The layout of the elements. If None, the function returns 0
loc (optional) – Location information for diagnostics, defaults to None
ip (optional) – Instruction pointer for diagnostics, defaults to None

Returns:

The total size in bytes. Returns 0 if the layout is None

Return type:

int

cutlass.cute.flatten_to_tuple( a: cutlass.cute.typing.IntTuple | cutlass.cute.typing.Coord | cutlass.cute.typing.Shape | cutlass.cute.typing.Stride, ) → tuple#

Flattens a potentially nested tuple structure into a flat tuple.

This function recursively traverses the input structure and flattens it into a single-level tuple, preserving the order of elements.

Parameters:: a (Union[IntTuple, Coord, Shape, Stride]) – The structure to flatten
Returns:: A flattened tuple containing all elements from the input
Return type:: tuple

Examples:

flatten_to_tuple((1, 2, 3))       # Returns (1, 2, 3)
flatten_to_tuple(((1, 2), 3))     # Returns (1, 2, 3)
flatten_to_tuple((1, (2, (3,))))  # Returns (1, 2, 3)

Flattens a CuTe data structure into a simpler form.

For tuples, this function flattens the structure into a single-level tuple. For non-tuple types, it returns the input unchanged.

Parameters:: a (Union[IntTuple, Coord, Shape, Stride, Layout, Tensor]) – The structure to flatten
Returns:: The flattened structure
Return type:: Union[tuple, Any]
Raises:: NotImplementedError – If input is a Layout or Tensor

Examples:

flatten((1, 2, 3))           # Returns (1, 2, 3)
flatten(((1, 2), (3, 4)))    # Returns (1, 2, 3, 4)
flatten(5)                   # Returns 5

cutlass.cute.product( a: cutlass.cute.typing.IntTuple | cutlass.cute.typing.Shape, *, loc=None, ip=None, )#

Return product of the given IntTuple or Shape.

Computes the product of all elements in the input tuple or shape. Returns static value if type is static.

Parameters:

a (IntTuple or Shape) – The input tuple or shape
loc (optional) – Source location for MLIR, defaults to None
ip (optional) – Insertion point, defaults to None

Returns:

Static product of IntTuple or Shape if static, otherwise a Value

Return type:

int or Value

Raises:

TypeError – If input is not an IntTuple or Shape

cutlass.cute.product_like( a: cutlass.cute.typing.IntTuple | cutlass.cute.typing.Shape, target_profile: cutlass.cute.typing.XTuple, *, loc=None, ip=None, )#

Return product of the given IntTuple or Shape at leaves of target_profile.

This function computes products according to the structure defined by target_profile.

Parameters:

a (IntTuple or Shape) – The input tuple or shape
target_profile (XTuple) – The profile that guides how products are computed
loc (optional) – Source location for MLIR, defaults to None
ip (optional) – Insertion point, defaults to None

Returns:

The resulting tuple with products computed according to target_profile

Return type:

IntTuple or Shape

Raises:

TypeError – If inputs have incompatible types
ValueError – If inputs have incompatible shapes

cutlass.cute.product_each(a, *, loc=None, ip=None)#

Compute products for each component of the input.

Returns a rank(a) tuple result such that get(result, mode=[i]) == product(get(a, mode=[i]))

Parameters:

a (IntTuple or Shape) – The input tuple or shape
loc (optional) – Source location for MLIR, defaults to None
ip (optional) – Insertion point, defaults to None

Returns:

A tuple containing products for each component

Return type:

tuple

Raises:

TypeError – If input is not an IntTuple or Shape

cutlass.cute.prepend( input, elem, up_to_rank: None | int = None, *, loc=None, ip=None, )#

Extend input to rank up_to_rank by prepending elem in front of input.

This function extends the input object by prepending elements to reach a desired rank. It supports various CuTe types including shapes, layouts, tensors etc.

Parameters:

input (Union[Shape, Stride, Coord, IntTuple, Tile, Layout, ComposedLayout, Tensor]) – Source to be prepended to
elem (Union[Shape, Stride, Coord, IntTuple, Tile, Layout]) – Element to prepend to input
up_to_rank (Union[None, int], optional) – The target rank after extension, defaults to None
loc (Optional[Location]) – Source location for MLIR, defaults to None
ip (Optional[InsertionPoint]) – Insertion point, defaults to None

Returns:

The extended result with prepended elements

Return type:

Union[Shape, Stride, Coord, IntTuple, Tile, Layout, ComposedLayout, Tensor]

Raises:

ValueError – If up_to_rank is less than input’s current rank
TypeError – If input or elem has unsupported type

Examples:

# Prepend to a Shape
shape = (4,4)
prepend(shape, 2)                   # Returns (2,4,4)

# Prepend to a Layout
layout = make_layout((8,8))
prepend(layout, make_layout((2,)))  # Returns (2,8,8):(1,1,8)

# Prepend with target rank
coord = (1,1)
prepend(coord, 0, up_to_rank=4)     # Returns (0,0,1,1)

cutlass.cute.append( input, elem, up_to_rank: None | int = None, *, loc=None, ip=None, )#

Extend input to rank up_to_rank by appending elem to the end of input.

This function extends the input object by appending elements to reach a desired rank. It supports various CuTe types including shapes, layouts, tensors etc.

Parameters:

input (Union[Shape, Stride, Coord, IntTuple, Tile, Layout, ComposedLayout, Tensor]) – Source to be appended to
elem (Union[Shape, Stride, Coord, IntTuple, Tile, Layout]) – Element to append to input
up_to_rank (Union[None, int], optional) – The target rank after extension, defaults to None
loc (Optional[Location]) – Source location for MLIR, defaults to None
ip (Optional[InsertionPoint]) – Insertion point, defaults to None

Returns:

The extended result with appended elements

Return type:

Union[Shape, Stride, Coord, IntTuple, Tile, Layout, ComposedLayout, Tensor]

Raises:

ValueError – If up_to_rank is less than input’s current rank
TypeError – If input or elem has unsupported type

Examples:

# Append to a Shape
shape = (4,4)
append(shape, 2)                   # Returns (4,4,2)

# Append to a Layout
layout = make_layout((8,8))
append(layout, make_layout((2,)))  # Returns (8,8,2):(1,8,1)

# Append with target rank
coord = (1,1)
append(coord, 0, up_to_rank=4)     # Returns (1,1,0,0)

Note

The function preserves the structure of the input while extending it
Can be used to extend tensors, layouts, shapes and other CuTe types
When up_to_rank is specified, fills remaining positions with elem
Useful for tensor reshaping and layout transformations

cutlass.cute.prepend_ones( t: cutlass.cute.typing.Tensor, up_to_rank: None | int = None, *, loc=None, ip=None, ) → cutlass.cute.typing.Tensor#

cutlass.cute.append_ones( t: cutlass.cute.typing.Tensor, up_to_rank: None | int = None, *, loc=None, ip=None, ) → cutlass.cute.typing.Tensor#

cutlass.cute.ceil_div( input: cutlass.cute.typing.Shape, tiler: cutlass.cute.typing.Tiler, *, loc=None, ip=None, ) → cutlass.cute.typing.Shape#

Compute the ceiling division of a target shape by a tiling specification.

This function computes the number of tiles required to cover the target domain. It is equivalent to the second mode of zipped_divide(input, tiler).

Parameters:

input (Shape) – A tuple of integers representing the dimensions of the target domain.
tiler (Union[Layout, Shape, Tile]) – The tiling specification.
loc (optional) – Optional location information for IR diagnostics.
ip (optional) – Optional instruction pointer or context for underlying IR functions.

Returns:

A tuple of integers representing the number of tiles required along each dimension, i.e. the result of the ceiling division of the input dimensions by the tiler dimensions.

Return type:

Shape

Example:

import cutlass.cute as cute
@cute.jit
def foo():
    input = (10, 6)
    tiler = (3, 4)
    result = cute.ceil_div(input, tiler)
    print(result)  # Outputs: (4, 2)

cutlass.cute.round_up( a: cutlass.cute.typing.IntTuple, b: cutlass.cute.typing.IntTuple, ) → cutlass.cute.typing.IntTuple#: Rounds up elements of a using elements of b.

cutlass.cute.slice_and_offset(coord, src, *, loc=None, ip=None)#

cutlass.cute.crd2idx( coord: cutlass.cute.typing.Coord, layout, *, loc=None, ip=None, )#

Convert a multi-dimensional coordinate into a value using the specified layout.

This function computes the inner product of the flattened coordinate and stride:

index = sum(flatten(coord)[i] * flatten(stride)[i] for i in range(len(coord)))

Parameters:

coord (Coord) – A tuple or list representing the multi-dimensional coordinate (e.g., (i, j) for a 2D layout).
layout (Layout or ComposedLayout) – A layout object that defines the memory storage layout, including shape and stride, used to compute the inner product.
loc (optional) – Optional location information for IR diagnostics.
ip (optional) – Optional instruction pointer or context for underlying IR functions.

Returns:

The result of applying the layout transformation to the provided coordinate.

Return type:

Any type that the layout maps to

Example:

import cutlass.cute as cute
@cute.jit
def foo():
    L = cute.make_layout((5, 4), stride=(4, 1))
    idx = cute.crd2idx((2, 3), L)
    # Computed as: 2 * 4 + 3 = 11
    print(idx)
foo()  # Expected output: 11

cutlass.cute.domain_offset( coord: cutlass.cute.typing.Coord, tensor: cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → cutlass.cute.typing.Tensor#

cutlass.cute.elem_less( lhs: cutlass.cute.typing.Shape | cutlass.cute.typing.IntTuple | cutlass.cute.typing.Coord, rhs: cutlass.cute.typing.Shape | cutlass.cute.typing.IntTuple | cutlass.cute.typing.Coord, *, loc=None, ip=None, )#

cutlass.cute.filter_zeros(input, *, target_profile=None, loc=None, ip=None)#

Filter out zeros from a layout or tensor.

This function removes zero-stride dimensions from a layout or tensor. See Section 3.3 in the CuTe Whitepaper for more details on layout operations.

Parameters:

input (Layout or Tensor) – The input layout or tensor to filter
target_profile (optional) – Target profile for the filtered result, defaults to None
loc (optional) – Source location for MLIR, defaults to None
ip (optional) – Insertion point, defaults to None

Returns:

The filtered layout or tensor with zeros removed

Return type:

Layout or Tensor

Raises:

TypeError – If input is not a Layout or Tensor

cutlass.cute.filter( input: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor, *, loc=None, ip=None, )#

Filter a layout or tensor.

This function filters a layout or tensor according to CuTe’s filtering rules.

Parameters:

input (Layout or Tensor) – The input layout or tensor to filter
loc (optional) – Source location for MLIR, defaults to None
ip (optional) – Insertion point, defaults to None

Returns:

The filtered layout or tensor

Return type:

Layout or Tensor

Raises:

TypeError – If input is not a Layout or Tensor

cutlass.cute.tile_to_shape( atom: cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster, trg_shape: cutlass.cute.typing.Shape, order: cutlass.cute.typing.Shape, *, loc=None, ip=None, ) → cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster#

cutlass.cute.shape_div( lhs: cutlass.cute.typing.Shape, rhs: cutlass.cute.typing.Shape, *, loc=None, ip=None, ) → cutlass.cute.typing.Shape#

Perform element-wise division of shapes.

This function performs element-wise division between two shapes.

Parameters:

lhs (Shape) – Left-hand side shape
rhs (Shape) – Right-hand side shape
loc (optional) – Source location for MLIR, defaults to None
ip (optional) – Insertion point, defaults to None

Returns:

The result of element-wise division

Return type:

Shape

cutlass.cute.composition( lhs, rhs: cutlass.cute.typing.Layout | cutlass.cute.typing.Shape | cutlass.cute.typing.Tile, *, loc=None, ip=None, )#

Compose two layout representations using the CuTe layout algebra.

Compose a left-hand layout (or tensor) with a right-hand operand into a new layout R, such that for every coordinate c in the domain of the right-hand operand, the composed layout satisfies:

R(c) = A(B(c))

where A is the left-hand operand provided as lhs and B is the right-hand operand provided as rhs. In this formulation, B defines the coordinate domain while A applies its transformation to B’s output, and the resulting layout R inherits the stride and shape adjustments from A.

Satisfies:: cute.shape(cute.composition(lhs, rhs)) is compatible with cute.shape(rhs)

Parameters:

lhs (Layout or Tensor) – The left-hand operand representing the transformation to be applied.
rhs (Layout, Shape, or Tile, or int or tuple) – The right-hand operand defining the coordinate domain. If provided as an int or tuple, it will be converted to a tile layout.
loc (optional) – Optional location information for IR diagnostics.
ip (optional) – Optional instruction pointer or context for underlying IR functions.

Returns:

A new composed layout R, such that for all coordinates c in the domain of rhs, R(c) = lhs(rhs(c)).

Return type:

Layout or Tensor

Example:

import cutlass.cute as cute
@cute.jit
def foo():
    # Create a layout that maps (i,j) to i*4 + j
    L1 = cute.make_layout((2, 3), stride=(4, 1))
    # Create a layout that maps (i,j) to i*3 + j
    L2 = cute.make_layout((3, 4), stride=(3, 1))
    # Compose L1 and L2
    L3 = cute.composition(L1, L2)
    # L3 now maps coordinates through L2 then L1

cutlass.cute.complement( input: cutlass.cute.typing.Layout, cotarget: cutlass.cute.typing.Layout | cutlass.cute.typing.Shape, *, loc=None, ip=None, ) → cutlass.cute.typing.Layout#

Compute the complement layout of the input layout with respect to the cotarget.

The complement of a layout A with respect to cotarget n is a layout A* such that for every k in Z_n and c in the domain of A, there exists a unique c* in the domain of A* where k = A(c) + A*(c*).

This operation is useful for creating layouts that partition a space in complementary ways, such as row and column layouts that together cover a matrix.

Parameters:

input (Layout) – The layout to compute the complement of
cotarget (Union[Layout, Shape]) – The target layout or shape that defines the codomain
loc (optional) – Optional location information for IR diagnostics
ip (optional) – Optional instruction pointer or context for underlying IR functions

Returns:

The complement layout

Return type:

Layout

Example:

import cutlass.cute as cute
@cute.jit
def foo():
    # Create a right-major layout for a 4x4 matrix
    row_layout = cute.make_layout((4, 4), stride=(4, 1))
    # Create a left-major layout that complements the row layout
    col_layout = cute.complement(row_layout, 16)
    # The two layouts are complementary under 16

cutlass.cute.right_inverse( input: cutlass.cute.typing.Layout, *, loc=None, ip=None, ) → cutlass.cute.typing.Layout#

cutlass.cute.left_inverse( input: cutlass.cute.typing.Layout, *, loc=None, ip=None, ) → cutlass.cute.typing.Layout#

cutlass.cute.max_common_layout( a: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor, b: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → cutlass.cute.typing.Layout#

cutlass.cute.max_common_vector( a: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor, b: cutlass.cute.typing.Layout | cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → int#

cutlass.cute.is_congruent( a: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor, b: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor, ) → bool#: Returns whether a is congruent to b.

cutlass.cute.is_weakly_congruent( a: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor, b: cutlass.cute.typing.XTuple | cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster | cutlass.cute.typing.Tensor, ) → bool#: Returns whether a is weakly congruent to b.

cutlass.cute.logical_product( block, tiler: cutlass.cute.typing.Layout, *, loc=None, ip=None, )#

cutlass.cute.zipped_product( block, tiler: cutlass.cute.typing.Layout, *, loc=None, ip=None, )#

cutlass.cute.tiled_product( block, tiler: cutlass.cute.typing.Layout, *, loc=None, ip=None, )#

cutlass.cute.flat_product( block, tiler: cutlass.cute.typing.Layout, *, loc=None, ip=None, )#

cutlass.cute.raked_product( block, tiler: cutlass.cute.typing.Layout, *, loc=None, ip=None, )#

cutlass.cute.blocked_product( block, tiler: cutlass.cute.typing.Layout, *, loc=None, ip=None, )#

cutlass.cute.flat_divide( target, tiler: cutlass.cute.typing.Tiler, *, loc=None, ip=None, )#

cutlass.cute.logical_divide( target, tiler: cutlass.cute.typing.Tiler, *, loc=None, ip=None, )#

cutlass.cute.zipped_divide( target, tiler: cutlass.cute.typing.Tiler, *, loc=None, ip=None, )#

cutlass.cute.tiled_divide( target, tiler: cutlass.cute.typing.Tiler, *, loc=None, ip=None, )#

cutlass.cute.local_partition( target: cutlass.cute.typing.Tensor, tiler: cutlass.cute.typing.Layout | cutlass.cute.typing.Shape, index, proj: cutlass.cute.typing.XTuple = 1, *, loc=None, ip=None, ) → cutlass.cute.typing.Tensor#

cutlass.cute.local_tile( input: cutlass.cute.typing.Tensor, tiler: cutlass.cute.typing.Layout | cutlass.cute.typing.Shape, coord: cutlass.cute.typing.Coord, proj: cutlass.cute.typing.XTuple | None = None, *, loc=None, ip=None, ) → cutlass.cute.typing.Tensor#

cutlass.cute.make_mma_atom(

op: MmaOp,

*,

loc=None,

ip=None,

**kwargs,

) → MmaAtom#

Makes an MMA Atom from an MMA Operation.

This function creates an MMA Atom from a given MMA Operation. Arbitrary kw arguments can be provided for Op-specific additional parameters. They are not used as of today.

Parameters:: op (MmaOp) – The MMA Operation to construct an Atom for
Returns:: The MMA Atom
Return type:: MmaAtom

cutlass.cute.make_tiled_mma(

op_or_atom: Op | MmaAtom,

atom_layout_mnk=(1, 1, 1),

permutation_mnk=None,

*,

loc=None,

ip=None,

**kwargs,

) → TiledMma#

Makes a tiled MMA from an MMA Operation or an MMA Atom.

Parameters:

op_or_atom (Union[Op, MmaAtom]) – The MMA Operation or Atom
atom_layout_mnk (Layout) – A Layout describing the tiling of Atom across threads
permutation_mnk (Tiler) – A permutation Tiler describing the tiling of Atom across values including any permutation of such tiling

Returns:

The resulting tiled MMA

Return type:

TiledMma

cutlass.cute.make_copy_atom(

op: CopyOp,

copy_internal_type: Type[cutlass.cute.typing.Numeric],

*,

loc=None,

ip=None,

**kwargs,

) → CopyAtom#

Makes a Copy Atom from a Copy Operation.

This function creates a Copy Atom from a given Copy Operation. Arbitrary kw arguments can be provided for Op-specific additional parameters.

Example:

op = cute.nvgpu.CopyUniversalOp()
atom = cute.make_copy_atom(op, tensor_dtype, num_bits_per_copy=64)

Parameters:

op (CopyOp) – The Copy Operation to construct an Atom for
copy_internal_type (Type[Numeric]) – An internal data type used to construct the source/destination layouts in unit of tensor elements

Returns:

The Copy Atom

Return type:

CopyAtom

cutlass.cute.make_tiled_copy_tv( atom, thr_layout, val_layout, *, loc=None, ip=None, ) → TiledCopy#

Create a tiled copy given separate thr and val layouts. A TV partitioner is inferred based on inputs. Requires input thr layout be compact.

Parameters:

atom (copy atom, e.g. smit_copy and simt_async_copy, tma_load, etc.)
thr_layout (mn -> tid (need to be compact?))
val_layout (mn -> vid)
loc (source location for mlir (optional))
ip (insertion point (optional))

Returns:

A tiled copy for partitioner

Return type:

tiled_copy

cutlass.cute.make_tiled_copy(atom, layout_tv, tiler_mn, *, loc=None, ip=None)#

Create a tiled type given a TV partitioner and tiler

Parameters:

atom (copy atom, e.g. smit_copy and simt_async_copy, tma_load, etc.)
layout_tv (thread-value layout.)
tiler_mn (tile size (??))
loc (source location for mlir (optional))
ip (insertion point (optional))

Returns:

A tuple of A tiled copy and atom

Return type:

tiled_copy

cutlass.cute.make_tiled_copy_C_atom( atom: CopyAtom, mma: TiledMma, *, loc=None, ip=None, )#

Create the smallest tiled copy that can retile LayoutC_TV for use with pipelined epilogues with subtiled stores

Parameters:

atom (CopyAtom)
mma (TiledMma)
loc (source location for mlir (optional))
ip (insertion point (optional))

Returns:

A tiled copy for partitioner

Return type:

tiled_copy

cutlass.cute.basic_copy( src: cutlass.cute.typing.Tensor, dst: cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → None#

Performs a basic element-wise copy.

This functions assumes the following pre-conditions: 1. size(src) == size(dst)

When the src and dst shapes are static, the pre-conditions are actually verified and the element-wise loop is fully unrolled.

cutlass.cute.basic_copy_if( pred: cutlass.cute.typing.Tensor, src: cutlass.cute.typing.Tensor, dst: cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → None#

Performs a basic predicated element-wise copy.

This functions assumes the following pre-conditions: 1. size(src) == size(dst) 2. size(src) == size(pred)

When all shapes are static, the pre-conditions are actually verified and the element-wise loop is fully unrolled.

cutlass.cute.autovec_copy( src: cutlass.cute.typing.Tensor, dst: cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → None#

Auto-vectorizing SIMT copy policy.

Given a source and destination tensors that are statically shaped, this policy figures out the largest safe vector width that the copy instruction can take and performs the copy.

cutlass.cute.copy(

atom: CopyAtom,

src: cutlass.cute.typing.Tensor,

dst: cutlass.cute.typing.Tensor,

*,

pred: cutlass.cute.typing.Tensor | None = None,

loc=None,

ip=None,

**kwargs,

) → None#

The Copy algorithm.

The “copy with Atom” expects source and destination tensors to be partitioned according to the provided Copy Atom. Some Atoms require additional Op-specific kw arguments, for example TMA copies:

cute.copy(tma_atom, src, dst, tma_bar_ptr=mbar_ptr, mcast_mask=mask)

An additional predication tensor can be provided. If the partitioned tensors have the following logical profile ((ATOM_V,ATOM_REST),REST_M,...), the predication tensor must have a profile consistent with (ATOM_REST,REST_M,...).

cutlass.cute.gemm(

atom: MmaAtom,

d: cutlass.cute.typing.Tensor,

a: cutlass.cute.typing.Tensor,

b: cutlass.cute.typing.Tensor,

c: cutlass.cute.typing.Tensor,

*,

loc=None,

ip=None,

**kwargs,

) → None#

The GEMM algorithm.

Computes D <- AB + C where C and D can alias. Note that some MMA Atoms (e.g. warpgroup-wide or tcgen05 MMAs) require manually setting an “accumulate” boolean field.

All tensors must be partitioned according to the provided MMA Atom.

cutlass.cute.full( shape, fill_value, dtype: Type[cutlass.cute.typing.Numeric], *, loc=None, ip=None, ) → TensorSSA#

Return a new TensorSSA of given shape and type, filled with fill_value.

Parameters:

shape (tuple) – Shape of the new tensor.
fill_value (scalar) – Value to fill the tensor with.
dtype (Type[Numeric]) – Data type of the tensor.

Returns:

Tensor of fill_value with the specified shape and dtype.

Return type:

TensorSSA

cutlass.cute.full_like( a: TensorSSA, fill_value, dtype: Type[cutlass.cute.typing.Numeric] | None = None, *, loc=None, ip=None, ) → TensorSSA#

Return a full TensorSSA with the same shape and type as a given array.

Parameters:

a (array_like) – The shape and data-type of a define these same attributes of the returned array.
fill_value (array_like) – Fill value.
dtype (Union[None, Type[Numeric]], optional) – Overrides the data type of the result, defaults to None

Returns:

Tensor of fill_value with the same shape and type as a.

Return type:

TensorSSA