cutlass.cute.arch#
The cute.arch
module contains wrappers around NVVM-level MLIR Op builders that seamlessly
inter-operate with the Python types used in CUTLASS Python. Another benefit of wrapping these Op
builders is that the source location can be tracked with the @dsl_user_op
decorator. Available
functions include
basic API like
thr_idx
;functions related to the direct management of mbarriers;
low-level SMEM management (prefer using the
SmemAllocator
class);TMEM management.
API documentation#
- cutlass.cute.arch.make_warp_uniform(
- value: cutlass.cute.typing.Int,
- *,
- loc=None,
- ip=None,
Creates a warp-uniform value from the given integer input.
- Parameters:
value (Int) – The integer to make warp uniform.
- Returns:
The warp-uniform value equal to the input.
- Return type:
Int32
- cutlass.cute.arch.elect_one(*, loc=None, ip=None) IfOpRegion #
Elects one thread within a warp.
with elect_one(): # Only one thread in the warp executes the code in this context pass
- cutlass.cute.arch.mbarrier_init_arrive_cnt(
- mbar_ptr: cutlass.cute.typing.Pointer,
- cnt: cutlass.cute.typing.Int,
- *,
- loc=None,
- ip=None,
Initializes a mbarrier with the specified thread arrival count.
- Parameters:
mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
cnt (Int) – The arrival count of the mbarrier
- cutlass.cute.arch.mbarrier_init_fence(*, loc=None, ip=None) None #
A fence operation that applies to the mbarrier initializations.
- cutlass.cute.arch.mbarrier_init_tx_bytes(
- mbar_ptr: cutlass.cute.typing.Pointer,
- bytes: cutlass.cute.typing.Int,
- peer_cta_rank_in_cluster=None,
- *,
- loc=None,
- ip=None,
Initializes a mbarrier with the specified number of transaction bytes.
- Parameters:
mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
bytes (Int) – The number of transaction bytes
peer_cta_rank_in_cluster – An optional CTA rank in cluster. If provided, the pointer to the mbarrier is converted to a remote address in the peer CTA’s SMEM.
- cutlass.cute.arch.mbarrier_wait(
- mbar_ptr: cutlass.cute.typing.Pointer,
- phase: cutlass.cute.typing.Int,
- *,
- loc=None,
- ip=None,
Waits on a mbarrier with a specified phase.
- Parameters:
mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
phase (Int) – The phase to wait for (either 0 or 1)
- cutlass.cute.arch.mbarrier_try_wait(
- mbar_ptr: cutlass.cute.typing.Pointer,
- phase: cutlass.cute.typing.Int,
- *,
- loc=None,
- ip=None,
Attempts to wait on a mbarrier with a specified phase in a non-blocking fashion.
- Parameters:
mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
phase (Int) – The phase to wait for (either 0 or 1)
- Returns:
A boolean value indicating whether the wait operation was successful
- Return type:
Boolean
- cutlass.cute.arch.conditional_mbarrier_try_wait(
- cond,
- mbar_ptr: cutlass.cute.typing.Pointer,
- phase: cutlass.cute.typing.Int,
- *,
- loc=None,
- ip=None,
Conditionally attempts to wait on a mbarrier with a specified phase in a non-blocking fashion.
- Parameters:
cond – A boolean predicate
mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
phase (Int) – The phase to wait for (either 0 or 1)
- Returns:
A boolean value indicating whether the wait operation was successful
- Return type:
Boolean
- cutlass.cute.arch.mbarrier_arrive(
- mbar_ptr: cutlass.cute.typing.Pointer,
- peer_cta_rank_in_cluster: cutlass.cute.typing.Int | None = None,
- *,
- loc=None,
- ip=None,
Arrives on an mbarrier.
- Parameters:
mbar_ptr (Pointer) – A pointer to the mbarrier in SMEM
peer_cta_rank_in_cluster – An optional CTA rank in cluster. If provided, the pointer to the mbarrier is converted to a remote address in the peer CTA’s SMEM.
- cutlass.cute.arch.lane_idx(*, loc=None, ip=None) cutlass.cute.typing.Int32 #
Returns the lane index of the current thread within the warp.
- cutlass.cute.arch.warp_idx(*, loc=None, ip=None) cutlass.cute.typing.Int32 #
Returns the warp index within a CTA.
- cutlass.cute.arch.thread_idx(
- *,
- loc=None,
- ip=None,
Returns the thread index within a CTA.
- cutlass.cute.arch.block_dim(
- *,
- loc=None,
- ip=None,
Returns the number of threads in each dimension of the CTA.
- cutlass.cute.arch.block_idx(
- *,
- loc=None,
- ip=None,
Returns the CTA identifier within a grid.
- cutlass.cute.arch.grid_dim(
- *,
- loc=None,
- ip=None,
Returns the number of CTAs in each dimension of the grid.
- cutlass.cute.arch.cluster_idx(
- *,
- loc=None,
- ip=None,
Returns the cluster identifier within a grid.
- cutlass.cute.arch.cluster_dim(
- *,
- loc=None,
- ip=None,
Returns the number of clusters in each dimension of the grid.
- cutlass.cute.arch.block_in_cluster_idx(
- *,
- loc=None,
- ip=None,
Returns the CTA index within a cluster across all dimensions.
- cutlass.cute.arch.block_in_cluster_dim(
- *,
- loc=None,
- ip=None,
Returns the dimensions of the cluster.
- cutlass.cute.arch.block_idx_in_cluster(
- *,
- loc=None,
- ip=None,
Returns the linearized identifier of the CTA within the cluster.
- cutlass.cute.arch.barrier(
- *,
- barrier_id=None,
- number_of_threads=None,
- loc=None,
- ip=None,
Creates a barrier, optionally named.
- cutlass.cute.arch.sync_threads(*, loc=None, ip=None) None #
Synchronizes all threads within a CTA.
- cutlass.cute.arch.sync_warp(
- mask: cutlass.cute.typing.Int = 4294967295,
- *,
- loc=None,
- ip=None,
Performs a warp-wide sync with an optional mask.
- cutlass.cute.arch.fence_acq_rel_cta(*, loc=None, ip=None) None #
Fence operation with acquire-release semantics.
See the PTX documentation.
- cutlass.cute.arch.fence_acq_rel_cluster(*, loc=None, ip=None) None #
Fence operation with acquire-release semantics.
See the PTX documentation.
- cutlass.cute.arch.fence_acq_rel_gpu(*, loc=None, ip=None) None #
Fence operation with acquire-release semantics.
See the PTX documentation.
- cutlass.cute.arch.fence_acq_rel_sys(*, loc=None, ip=None) None #
Fence operation with acquire-release semantics.
See the PTX documentation.
- cutlass.cute.arch.cp_async_commit_group(*, loc=None, ip=None) None #
Commits all prior initiated but uncommitted cp.async instructions.
See the PTX documentation.
- cutlass.cute.arch.cp_async_wait_group(n, *, loc=None, ip=None) None #
Waits till only a specified numbers of cp.async groups are pending.
See the PTX documentation.
- cutlass.cute.arch.cp_async_bulk_commit_group(*, loc=None, ip=None) None #
Commits all prior initiated but uncommitted cp.async.bulk instructions.
See the PTX documentation.
- cutlass.cute.arch.cp_async_bulk_wait_group(
- group,
- *,
- read=None,
- loc=None,
- ip=None,
Waits till only a specified numbers of cp.async.bulk groups are pending.
See the PTX documentation.
- cutlass.cute.arch.cluster_wait(*, loc=None, ip=None) None #
A cluster-wide wait operation.
- cutlass.cute.arch.cluster_arrive(*, aligned=None, loc=None, ip=None) None #
A cluster-wide arrive operation.
- cutlass.cute.arch.cluster_arrive_relaxed(*, aligned=None, loc=None, ip=None) None #
A cluster-wide arrive operation with relaxed semantics.
- cutlass.cute.arch.fence_proxy(
- kind: cutlass._mlir.dialects.nvvm.ProxyKind,
- *,
- space: cutlass._mlir.dialects.nvvm.SharedSpace | None = None,
- use_intrinsic=None,
- loc=None,
- ip=None,
- cutlass.cute.arch.vote_ballot_sync(
- pred: cutlass.cute.typing.Boolean,
- mask: cutlass.cute.typing.Int = 4294967295,
- *,
- loc=None,
- ip=None,
Performs a ballot operation across the warp.
- cutlass.cute.arch.popc(
- value: cutlass.cute.typing.Numeric,
- *,
- loc=None,
- ip=None,
Performs a population count operation.
- cutlass.cute.arch.fmax(
- a: float | cutlass.cute.typing.Float32,
- b: float | cutlass.cute.typing.Float32,
- *,
- loc=None,
- ip=None,
- cutlass.cute.arch.rcp_approx(
- a: float | cutlass.cute.typing.Float32,
- *,
- loc=None,
- ip=None,
- cutlass.cute.arch.exp2(
- a: float | cutlass.cute.typing.Float32,
- *,
- loc=None,
- ip=None,
- cutlass.cute.arch.alloc_smem(
- element_type: Type[cutlass.cute.typing.Numeric],
- size_in_elems: int,
- alignment: None | int = None,
- *,
- loc=None,
- ip=None,
Statically allocates SMEM.
- Parameters:
element_type (Type[Numeric]) – The pointee type of the pointer.
size_in_elems (int) – The size of the allocation in terms of number of elements of the pointee type
alignment (int) – An optional pointer alignment for the allocation
- Returns:
A pointer to the start of the allocation
- Return type:
Pointer
- cutlass.cute.arch.get_dyn_smem(
- element_type: Type[cutlass.cute.typing.Numeric],
- alignment: None | int = None,
- *,
- loc=None,
- ip=None,
Retrieves a pointer to a dynamic SMEM allocation.
- Parameters:
element_type (Type[Numeric]) – The pointee type of the pointer.
alignment (int) – An optional pointer alignment, the result pointer is offset appropriately
- Returns:
A pointer to the start of the dynamic SMEM allocation with a correct alignement
- Return type:
Pointer
- cutlass.cute.arch.retrieve_tmem_ptr(
- element_type: Type[cutlass.cute.typing.Numeric],
- alignment: int,
- ptr_to_buffer_holding_addr: cutlass.cute.typing.Pointer,
- *,
- loc=None,
- ip=None,
Retrieves a pointer to TMEM with the provided element type and alignment.
- Parameters:
element_type (Type[Numeric]) – The pointee type of the pointer.
alignment (int) – The alignment of the result pointer
ptr_to_buffer_holding_addr (Pointer) – A pointer to a SMEM buffer holding the TMEM address of the start of the allocation allocation
- Returns:
A pointer to TMEM
- Return type:
Pointer
- cutlass.cute.arch.alloc_tmem(
- num_columns: cutlass.cute.typing.Int,
- smem_ptr_to_write_address: cutlass.cute.typing.Pointer,
- is_two_cta=None,
- *,
- loc=None,
- ip=None,
Allocates TMEM.
- Parameters:
num_columns (Int) – The number of TMEM columns to allocate
smem_ptr_to_write_address (Pointer) – A pointer to a SMEM buffer where the TMEM address is written to
is_two_cta – Optional boolean parameter for 2-CTA MMAs
- cutlass.cute.arch.relinquish_tmem_alloc_permit(
- is_two_cta=None,
- *,
- loc=None,
- ip=None,
Relinquishes the right to allocate TMEM so that other CTAs potentially in a different grid can allocate.
- cutlass.cute.arch.dealloc_tmem(
- tmem_ptr: cutlass.cute.typing.Pointer,
- num_columns: cutlass.cute.typing.Int,
- is_two_cta=None,
- *,
- loc=None,
- ip=None,
Deallocates TMEM using the provided pointer and number of columns.
- Parameters:
tmem_ptr (Pointer) – A pointer to the TMEM allocation to de-allocate
num_columns (Int) – The number of columns in the TMEM allocation
is_two_cta – Optional boolean parameter for 2-CTA MMAs