arch#

The cute.arch module provides lightweight wrappers for NVVM Operation builders which implement CUDA built-in device functions such as thread_idx. It integrates seamlessly with CuTe DSL types.

These wrappers enable source location tracking through the @dsl_user_op decorator. The module includes the following functionality:

Core CUDA built-in functions such as thread_idx, warp_idx, block_dim, grid_dim, cluster_dim, and related functions
Memory barrier management functions including mbarrier_init, mbarrier_arrive, mbarrier_wait, and associated operations
Low-level shared memory (SMEM) management capabilities, with SmemAllocator as the recommended interface
Low-level tensor memory (TMEM) management capabilities, with TmemAllocator as the recommended interface

API documentation#

cutlass.cute.arch.make_warp_uniform( value: cutlass.cute.typing.Int, *, loc=None, ip=None, ) → cutlass.cute.typing.Int32#

Provides a compiler hint indicating that the specified value is invariant across all threads in the warp, which may enable performance optimizations.

Parameters:: value (Int) – The integer value to be marked as warp-uniform.
Returns:: The input value, marked as warp-uniform.
Return type:: Int32

cutlass.cute.arch.elect_one(*, loc=None, ip=None) → IfOpRegion#

Elects one thread within a warp.

with elect_one():
    # Only one thread in the warp executes the code in this context
    pass

cutlass.cute.arch.mbarrier_init( mbar_ptr: cutlass.cute.typing.Pointer, cnt: cutlass.cute.typing.Int, *, loc=None, ip=None, ) → None#

Initializes a mbarrier with the specified thread arrival count.

Parameters:

mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
cnt (Int) – The arrival count of the mbarrier

cutlass.cute.arch.mbarrier_init_fence(*, loc=None, ip=None) → None#: A fence operation that applies to the mbarrier initializations.

cutlass.cute.arch.mbarrier_arrive_and_expect_tx( mbar_ptr: cutlass.cute.typing.Pointer, bytes: cutlass.cute.typing.Int, peer_cta_rank_in_cluster=None, *, loc=None, ip=None, ) → None#

Arrives on a mbarrier and expects a specified number of transaction bytes.

Parameters:

mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
bytes (Int) – The number of transaction bytes
peer_cta_rank_in_cluster – An optional CTA rank in cluster. If provided, the pointer to the mbarrier is converted to a remote address in the peer CTA’s SMEM.

cutlass.cute.arch.mbarrier_expect_tx( mbar_ptr: cutlass.cute.typing.Pointer, bytes: cutlass.cute.typing.Int, peer_cta_rank_in_cluster=None, *, loc=None, ip=None, ) → None#

Expects a specified number of transaction bytes without an arrive.

Parameters:

mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
bytes (Int) – The number of transaction bytes
peer_cta_rank_in_cluster – An optional CTA rank in cluster. If provided, the pointer to the mbarrier is converted to a remote address in the peer CTA’s SMEM.

cutlass.cute.arch.mbarrier_wait( mbar_ptr: cutlass.cute.typing.Pointer, phase: cutlass.cute.typing.Int, *, loc=None, ip=None, ) → None#

Waits on a mbarrier with a specified phase.

Parameters:

mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
phase (Int) – The phase to wait for (either 0 or 1)

cutlass.cute.arch.mbarrier_try_wait( mbar_ptr: cutlass.cute.typing.Pointer, phase: cutlass.cute.typing.Int, *, loc=None, ip=None, ) → cutlass.cute.typing.Boolean#

Attempts to wait on a mbarrier with a specified phase in a non-blocking fashion.

Parameters:

mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
phase (Int) – The phase to wait for (either 0 or 1)

Returns:

A boolean value indicating whether the wait operation was successful

Return type:

Boolean

cutlass.cute.arch.mbarrier_conditional_try_wait( cond, mbar_ptr: cutlass.cute.typing.Pointer, phase: cutlass.cute.typing.Int, *, loc=None, ip=None, ) → cutlass.cute.typing.Boolean#

Conditionally attempts to wait on a mbarrier with a specified phase in a non-blocking fashion.

Parameters:

cond – A boolean predicate
mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
phase (Int) – The phase to wait for (either 0 or 1)

Returns:

A boolean value indicating whether the wait operation was successful

Return type:

Boolean

cutlass.cute.arch.mbarrier_arrive( mbar_ptr: cutlass.cute.typing.Pointer, peer_cta_rank_in_cluster: cutlass.cute.typing.Int | None = None, *, loc=None, ip=None, ) → None#

Arrives on an mbarrier.

Parameters:

mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
peer_cta_rank_in_cluster – An optional CTA rank in cluster. If provided, the pointer to the mbarrier is converted to a remote address in the peer CTA’s SMEM.

cutlass.cute.arch.lane_idx(*, loc=None, ip=None) → cutlass.cute.typing.Int32#: Returns the lane index of the current thread within the warp.

cutlass.cute.arch.warp_idx(*, loc=None, ip=None) → cutlass.cute.typing.Int32#: Returns the warp index within a CTA.

cutlass.cute.arch.thread_idx( *, loc=None, ip=None, ) → Tuple[cutlass.cute.typing.Int32, cutlass.cute.typing.Int32, cutlass.cute.typing.Int32]#: Returns the thread index within a CTA.

cutlass.cute.arch.block_dim( *, loc=None, ip=None, ) → Tuple[cutlass.cute.typing.Int32, cutlass.cute.typing.Int32, cutlass.cute.typing.Int32]#: Returns the number of threads in each dimension of the CTA.

cutlass.cute.arch.block_idx( *, loc=None, ip=None, ) → Tuple[cutlass.cute.typing.Int32, cutlass.cute.typing.Int32, cutlass.cute.typing.Int32]#: Returns the CTA identifier within a grid.

cutlass.cute.arch.grid_dim( *, loc=None, ip=None, ) → Tuple[cutlass.cute.typing.Int32, cutlass.cute.typing.Int32, cutlass.cute.typing.Int32]#: Returns the number of CTAs in each dimension of the grid.

cutlass.cute.arch.cluster_idx( *, loc=None, ip=None, ) → Tuple[cutlass.cute.typing.Int32, cutlass.cute.typing.Int32, cutlass.cute.typing.Int32]#: Returns the cluster identifier within a grid.

cutlass.cute.arch.cluster_dim( *, loc=None, ip=None, ) → Tuple[cutlass.cute.typing.Int32, cutlass.cute.typing.Int32, cutlass.cute.typing.Int32]#: Returns the number of clusters in each dimension of the grid.

cutlass.cute.arch.cluster_size(*, loc=None, ip=None) → cutlass.cute.typing.Int32#: Returns the number of CTA within the cluster.

cutlass.cute.arch.block_in_cluster_idx( *, loc=None, ip=None, ) → Tuple[cutlass.cute.typing.Int32, cutlass.cute.typing.Int32, cutlass.cute.typing.Int32]#: Returns the CTA index within a cluster across all dimensions.

cutlass.cute.arch.block_in_cluster_dim( *, loc=None, ip=None, ) → Tuple[cutlass.cute.typing.Int32, cutlass.cute.typing.Int32, cutlass.cute.typing.Int32]#: Returns the dimensions of the cluster.

cutlass.cute.arch.block_idx_in_cluster( *, loc=None, ip=None, ) → cutlass.cute.typing.Int32#: Returns the linearized identifier of the CTA within the cluster.

cutlass.cute.arch.barrier( *, barrier_id=None, number_of_threads=None, loc=None, ip=None, ) → None#: Creates a barrier, optionally named.

cutlass.cute.arch.barrier_arrive( *, barrier_id=None, number_of_threads=None, loc=None, ip=None, ) → None#

cutlass.cute.arch.sync_threads(*, loc=None, ip=None) → None#: Synchronizes all threads within a CTA.

cutlass.cute.arch.sync_warp( mask: cutlass.cute.typing.Int = 4294967295, *, loc=None, ip=None, ) → None#: Performs a warp-wide sync with an optional mask.

cutlass.cute.arch.fence_acq_rel_cta(*, loc=None, ip=None) → None#

Fence operation with acquire-release semantics.

See the PTX documentation.

cutlass.cute.arch.fence_acq_rel_cluster(*, loc=None, ip=None) → None#

Fence operation with acquire-release semantics.

See the PTX documentation.

cutlass.cute.arch.fence_acq_rel_gpu(*, loc=None, ip=None) → None#

Fence operation with acquire-release semantics.

See the PTX documentation.

cutlass.cute.arch.fence_acq_rel_sys(*, loc=None, ip=None) → None#

Fence operation with acquire-release semantics.

See the PTX documentation.

cutlass.cute.arch.cp_async_commit_group(*, loc=None, ip=None) → None#

Commits all prior initiated but uncommitted cp.async instructions.

See the PTX documentation.

cutlass.cute.arch.cp_async_wait_group(n, *, loc=None, ip=None) → None#

Waits till only a specified numbers of cp.async groups are pending.

See the PTX documentation.

cutlass.cute.arch.cp_async_bulk_commit_group(*, loc=None, ip=None) → None#

Commits all prior initiated but uncommitted cp.async.bulk instructions.

See the PTX documentation.

cutlass.cute.arch.cp_async_bulk_wait_group( group, *, read=None, loc=None, ip=None, ) → None#

Waits till only a specified numbers of cp.async.bulk groups are pending.

See the PTX documentation.

cutlass.cute.arch.cluster_wait(*, loc=None, ip=None) → None#: A cluster-wide wait operation.

cutlass.cute.arch.cluster_arrive(*, aligned=None, loc=None, ip=None) → None#: A cluster-wide arrive operation.

cutlass.cute.arch.cluster_arrive_relaxed(*, aligned=None, loc=None, ip=None) → None#: A cluster-wide arrive operation with relaxed semantics.

cutlass.cute.arch.vote_ballot_sync( pred: cutlass.cute.typing.Boolean, mask: cutlass.cute.typing.Int = 4294967295, *, loc=None, ip=None, ) → cutlass.cute.typing.Int32#

Performs a ballot operation across the warp.

It copies the predicate from each thread in mask into the corresponding bit position of destination register d, where the bit position corresponds to the thread’s lane id.

Parameters:

pred (Boolean) – The predicate value for the current thread
mask (Int, optional) – A 32-bit integer mask specifying which threads participate, defaults to all threads (0xFFFFFFFF)

Returns:

A 32-bit integer where each bit represents a thread’s predicate value

Return type:

Int32

See the PTX documentation.

cutlass.cute.arch.vote_any_sync( pred: cutlass.cute.typing.Boolean, mask: cutlass.cute.typing.Int = 4294967295, *, loc=None, ip=None, ) → cutlass.cute.typing.Boolean#

True if source predicate is True for any non-exited threads in mask. Negate the source predicate to compute .not_all.

Parameters:

pred (Boolean) – The predicate value for the current thread
mask (Int, optional) – A 32-bit integer mask specifying which threads participate, defaults to all threads (0xFFFFFFFF)

Returns:

A boolean value indicating if the source predicate is True for all non-exited threads in mask

Return type:

Boolean

See the PTX documentation.

cutlass.cute.arch.vote_all_sync( pred: cutlass.cute.typing.Boolean, mask: cutlass.cute.typing.Int = 4294967295, *, loc=None, ip=None, ) → cutlass.cute.typing.Boolean#

True if source predicate is True for all non-exited threads in mask. Negate the source predicate to compute .none.

Parameters:

pred (Boolean) – The predicate value for the current thread
mask (Int, optional) – A 32-bit integer mask specifying which threads participate, defaults to all threads (0xFFFFFFFF)

Returns:

A boolean value indicating if the source predicate is True for all non-exited threads in mask

Return type:

Boolean

See the PTX documentation.

cutlass.cute.arch.vote_uni_sync( pred: cutlass.cute.typing.Boolean, mask: cutlass.cute.typing.Int = 4294967295, *, loc=None, ip=None, ) → cutlass.cute.typing.Boolean#

True f source predicate has the same value in all non-exited threads in mask. Negating the source predicate also computes .uni

Parameters:

pred (Boolean) – The predicate value for the current thread
mask (Int, optional) – A 32-bit integer mask specifying which threads participate, defaults to all threads (0xFFFFFFFF)

Returns:

A boolean value indicating if the source predicate is True for all non-exited threads in mask

Return type:

Boolean

cutlass.cute.arch.popc( value: cutlass.cute.typing.Numeric, *, loc=None, ip=None, ) → cutlass.cute.typing.Numeric#: Performs a population count operation.

cutlass.cute.arch.fence_proxy( kind: cutlass._mlir.dialects.nvvm.ProxyKind, *, space: cutlass._mlir.dialects.nvvm.SharedSpace | None = None, use_intrinsic=None, loc=None, ip=None, ) → None#

cutlass.cute.arch.fmax( a: float | cutlass.cute.typing.Float32, b: float | cutlass.cute.typing.Float32, *, loc=None, ip=None, ) → cutlass.cute.typing.Float32#

cutlass.cute.arch.rcp_approx( a: float | cutlass.cute.typing.Float32, *, loc=None, ip=None, )#

cutlass.cute.arch.exp2( a: float | cutlass.cute.typing.Float32, *, loc=None, ip=None, ) → cutlass.cute.typing.Float32#

cutlass.cute.arch.alloc_smem( element_type: Type[cutlass.cute.typing.Numeric], size_in_elems: int, alignment: int | None = None, *, loc=None, ip=None, ) → cutlass.cute.typing.Pointer#

Statically allocates SMEM.

Parameters:

element_type (Type[Numeric]) – The pointee type of the pointer.
size_in_elems (int) – The size of the allocation in terms of number of elements of the pointee type
alignment (int) – An optional pointer alignment for the allocation

Returns:

A pointer to the start of the allocation

Return type:

Pointer

cutlass.cute.arch.get_dyn_smem( element_type: Type[cutlass.cute.typing.Numeric], alignment: int | None = None, *, loc=None, ip=None, ) → cutlass.cute.typing.Pointer#

Retrieves a pointer to a dynamic SMEM allocation.

Parameters:

element_type (Type[Numeric]) – The pointee type of the pointer.
alignment (int) – An optional pointer alignment, the result pointer is offset appropriately

Returns:

A pointer to the start of the dynamic SMEM allocation with a correct alignement

Return type:

Pointer

cutlass.cute.arch.get_dyn_smem_size(*, loc=None, ip=None) → int#

Gets the size in bytes of the dynamic shared memory that was specified at kernel launch time. This can be used for bounds checking during shared memory allocation.

Returns:: The size of dynamic shared memory in bytes
Return type:: int

cutlass.cute.arch.retrieve_tmem_ptr( element_type: Type[cutlass.cute.typing.Numeric], alignment: int, ptr_to_buffer_holding_addr: cutlass.cute.typing.Pointer, *, loc=None, ip=None, ) → cutlass.cute.typing.Pointer#

Retrieves a pointer to TMEM with the provided element type and alignment.

Parameters:

element_type (Type[Numeric]) – The pointee type of the pointer.
alignment (int) – The alignment of the result pointer
ptr_to_buffer_holding_addr (Pointer) – A pointer to a SMEM buffer holding the TMEM address of the start of the allocation allocation

Returns:

A pointer to TMEM

Return type:

Pointer

cutlass.cute.arch.alloc_tmem( num_columns: cutlass.cute.typing.Int, smem_ptr_to_write_address: cutlass.cute.typing.Pointer, is_two_cta=None, *, loc=None, ip=None, ) → None#

Allocates TMEM.

Parameters:

num_columns (Int) – The number of TMEM columns to allocate
smem_ptr_to_write_address (Pointer) – A pointer to a SMEM buffer where the TMEM address is written to
is_two_cta – Optional boolean parameter for 2-CTA MMAs

cutlass.cute.arch.relinquish_tmem_alloc_permit( is_two_cta=None, *, loc=None, ip=None, ) → None#: Relinquishes the right to allocate TMEM so that other CTAs potentially in a different grid can allocate.

cutlass.cute.arch.dealloc_tmem( tmem_ptr: cutlass.cute.typing.Pointer, num_columns: cutlass.cute.typing.Int, is_two_cta=None, *, loc=None, ip=None, ) → None#

Deallocates TMEM using the provided pointer and number of columns.

Parameters:

tmem_ptr (Pointer) – A pointer to the TMEM allocation to de-allocate
num_columns (Int) – The number of columns in the TMEM allocation
is_two_cta – Optional boolean parameter for 2-CTA MMAs

cutlass.cute.arch.prmt(src, src_reg_shifted, prmt_indices, *, loc=None, ip=None)#

cutlass.cute.arch.cvt_i8_bf16_intrinsic(vec_i8, length, *, loc=None, ip=None)#

Convert a vector of int8 to a vector of bfloat16.

Parameters:

vec_i8 (1D vector of int8) – The input vector of int8.
length (int) – The length of the input vector.

Returns:

The output 1D vector of bfloat16 with the same length as the input vector.

Return type:

1D vector of bfloat16

cutlass.cute.arch.cvt_i4_bf16_intrinsic(vec_i4, length, *, loc=None, ip=None)#

Convert a vector of int4 to a vector of bfloat16.

Parameters:

vec_i4 (1D vector of int4) – The input vector of int4.
length (int) – The length of the input vector.

Returns:

The output 1D vector of bfloat16 with the same length as the input vector.

Return type:

1D vector of bfloat16

cutlass.cute.arch.cvt_f4e2m1_f16_intrinsic(vec_f4e2m1, length, *, loc=None, ip=None)#

Convert a vector of float4e2m1 to a vector of float16.

Parameters:

vec_f4e2m1 (1D vector of float4e2m1) – The input vector of float4e2m1.
length (int) – The length of the input vector.

Returns:

The output 1D vector of float16 with the same length as the input vector.

Return type:

1D vector of float16

cutlass.cute.arch.cvt_i8x4_to_f32x4(src_vec4, *, loc=None, ip=None)#

cutlass.cute.arch.cvt_i8x2_to_f32x2(src_vec2, *, loc=None, ip=None)#

cutlass.cute.arch.cvt_i8_bf16(src_i8, *, loc=None, ip=None)#

cutlass.cute.arch.cvt_f32x2_bf16x2(src_vec2, *, loc=None, ip=None)#