cutlass.utils#

class cutlass.utils.WorkTileInfo(
tile_idx: cutlass.cute.typing.Coord,
is_valid_tile: cutlass.cutlass_dsl.Boolean,
)#

Bases: object

A class to represent information about a work tile.

Variables:
  • tile_idx – The index of the tile.

  • is_valid_tile – Whether the tile is valid.

__init__(
tile_idx: cutlass.cute.typing.Coord,
is_valid_tile: cutlass.cutlass_dsl.Boolean,
)#
property is_valid_tile: cutlass.cutlass_dsl.Boolean#

Check latest tile returned by the scheduler is valid or not. Any scheduling requests after all tasks completed will return an invalid tile.

Returns:

The validity of the tile.

Return type:

Boolean

property tile_idx: cutlass.cute.typing.Coord#

Get the index of the tile.

Returns:

The index of the tile.

Return type:

cute.Coord

class cutlass.utils.PersistentTileSchedulerParams(
problem_shape_ntile_mnl: cutlass.cute.typing.Shape,
cluster_shape_mnk: cutlass.cute.typing.Shape,
*,
loc=None,
ip=None,
)#

Bases: object

A class to represent parameters for a persistent tile scheduler.

This class is designed to manage and compute the layout of clusters and tiles in a batched gemm problem.

Variables:
  • cluster_shape_mn – Shape of the cluster in (m, n) dimensions (K dimension cta count must be 1).

  • problem_layout_ncluster_mnl – Layout of the problem in terms of number of clusters in (m, n, l) dimensions.

__init__(
problem_shape_ntile_mnl: cutlass.cute.typing.Shape,
cluster_shape_mnk: cutlass.cute.typing.Shape,
*,
loc=None,
ip=None,
)#

Initializes the PersistentTileSchedulerParams with the given parameters.

Parameters:
  • problem_shape_ntile_mnl (cute.Shape) – The shape of the problem in terms of number of CTA (Cooperative Thread Array) in (m, n, l) dimensions.

  • cluster_shape_mnk (cute.Shape) – The shape of the cluster in (m, n) dimensions.

Raises:

ValueError – If cluster_shape_k is not 1.

get_grid_shape(
max_active_clusters: cutlass.cutlass_dsl.Int32,
*,
loc=None,
ip=None,
) Tuple[cutlass.cutlass_dsl.Integer, cutlass.cutlass_dsl.Integer, cutlass.cutlass_dsl.Integer]#

Computes the grid shape based on the maximum active clusters allowed.

Parameters:

max_active_clusters (Int32) – The maximum number of active clusters that can run in one wave.

Returns:

A tuple containing the grid shape in (m, n, persistent_clusters). - m: self.cluster_shape_m. - n: self.cluster_shape_n. - persistent_clusters: Number of persistent clusters that can run.

class cutlass.utils.StaticPersistentTileScheduler(
params: PersistentTileSchedulerParams,
num_persistent_clusters: cutlass.cutlass_dsl.Int32,
current_work_linear_idx: cutlass.cutlass_dsl.Int32,
cta_id_in_cluster: cutlass.cute.typing.Coord,
num_tiles_executed: cutlass.cutlass_dsl.Int32,
)#

Bases: object

A scheduler for static persistent tile execution in CUTLASS/CuTe kernels.

Variables:
  • params – Tile schedule related params, including cluster shape and problem_layout_ncluster_mnl

  • num_persistent_clusters – Number of persistent clusters that can be launched

  • cta_id_in_cluster – ID of the CTA within its cluster

  • _num_tiles_executed – Counter for executed tiles

  • _current_work_linear_idx – Current cluster index

__init__(
params: PersistentTileSchedulerParams,
num_persistent_clusters: cutlass.cutlass_dsl.Int32,
current_work_linear_idx: cutlass.cutlass_dsl.Int32,
cta_id_in_cluster: cutlass.cute.typing.Coord,
num_tiles_executed: cutlass.cutlass_dsl.Int32,
)#

Initializes the StaticPersistentTileScheduler with the given parameters.

Parameters:
  • params (PersistentTileSchedulerParams) – Tile schedule related params, including cluster shape and problem_layout_ncluster_mnl.

  • num_persistent_clusters (Int32) – Number of persistent clusters that can be launched.

  • current_work_linear_idx (Int32) – Current cluster index.

  • cta_id_in_cluster (cute.Coord) – ID of the CTA within its cluster.

  • num_tiles_executed (Int32) – Counter for executed tiles.

create(
params: PersistentTileSchedulerParams,
block_idx: Tuple[cutlass.cutlass_dsl.Integer, cutlass.cutlass_dsl.Integer, cutlass.cutlass_dsl.Integer],
grid_dim: Tuple[cutlass.cutlass_dsl.Integer, cutlass.cutlass_dsl.Integer, cutlass.cutlass_dsl.Integer],
*,
loc=None,
ip=None,
)#

Initialize the static persistent tile scheduler.

Parameters:
  • params (PersistentTileSchedulerParams) – Parameters for the persistent tile scheduler.

  • block_idx (Tuple[Integer, Integer, Integer]) – The 3d block index in the format (bidx, bidy, bidz).

  • grid_dim (Tuple[Integer, Integer, Integer]) – The 3d grid dimensions for kernel launch.

Returns:

A StaticPersistentTileScheduler object.

Return type:

StaticPersistentTileScheduler

static get_grid_shape(
params: PersistentTileSchedulerParams,
max_active_clusters: cutlass.cutlass_dsl.Int32,
*,
loc=None,
ip=None,
) Tuple[cutlass.cutlass_dsl.Integer, cutlass.cutlass_dsl.Integer, cutlass.cutlass_dsl.Integer]#

Calculates the grid shape to be launched on GPU using problem shape, threadblock shape, and active cluster size.

Parameters:
  • params (PersistentTileSchedulerParams) – Parameters for grid shape calculation.

  • max_active_clusters (Int32) – Maximum active clusters allowed.

Returns:

The calculated 3d grid shape.

Return type:

Tuple[Integer, Integer, Integer]

_get_current_work_for_linear_idx(
current_work_linear_idx: cutlass.cutlass_dsl.Int32,
*,
loc=None,
ip=None,
) WorkTileInfo#

Compute current tile coord given current_work_linear_idx and cta_id_in_cluster.

Parameters:

current_work_linear_idx (Int32) – The linear index of the current work.

Returns:

An object containing information about the current tile coordinates and validity status.

Return type:

WorkTileInfo

get_current_work(
*,
loc=None,
ip=None,
) WorkTileInfo#
initial_work_tile_info(
*,
loc=None,
ip=None,
) WorkTileInfo#
advance_to_next_work(
*,
advance_count: int = 1,
loc=None,
ip=None,
)#
property num_tiles_executed: cutlass.cutlass_dsl.Int32#
class cutlass.utils.TensorMapUpdateMode(value)#

Bases: Enum

Enum class defining tensor map update modes.

Modes: GMEM: Update tensormap in global memory SMEM: Load tensormap from global memory to shared memory, update it in shared memory, then store back to global memory

GMEM = 1#
SMEM = 2#
class cutlass.utils.TensorMapManager(
tensormap_update_mode: TensorMapUpdateMode,
bytes_per_tensormap: int,
)#

Bases: object

Manages TensorMap operations including initialization and updates. Provides utilities to convert tensormap pointer to across different memory spaces.

tensormap_update_mode: TensorMapUpdateMode#
bytes_per_tensormap: int#
get_tensormap_ptr(
ptr: cutlass.cute.typing.Pointer,
address_space=cutlass._mlir.dialects.cute.AddressSpace.gmem,
) cutlass.cute.typing.Pointer#
init_tensormap_from_atom(
copy_atom: CopyAtom,
dst_ptr: cutlass.cute.typing.Pointer,
warp_id: int,
) None#
fence_tensormap_initialization() None#
fence_tensormap_update(
tensormap_ptr: cutlass.cute.typing.Pointer,
) None#
update_tensormap(
tensor_gmem: Tuple[cutlass.cute.typing.Tensor, ...],
tma_copy_atom: Tuple[CopyAtom, ...],
tensormap_gmem_ptr: Tuple[cutlass.cute.typing.Pointer, ...],
warp_id: int,
tensormap_smem_ptr: Tuple[cutlass.cute.typing.Pointer, ...],
) None#
__init__(
tensormap_update_mode: TensorMapUpdateMode,
bytes_per_tensormap: int,
) None#
class cutlass.utils.GroupSearchResult(
group_idx: cutlass.cutlass_dsl.Int32,
cta_tile_idx_m: cutlass.cutlass_dsl.Int32,
cta_tile_idx_n: cutlass.cutlass_dsl.Int32,
problem_shape_m: cutlass.cutlass_dsl.Int32,
problem_shape_n: cutlass.cutlass_dsl.Int32,
problem_shape_k: cutlass.cutlass_dsl.Int32,
cta_tile_count_k: cutlass.cutlass_dsl.Int32,
)#

Bases: object

The result of the group search for grouped gemm.

Parameters:
  • group_idx (Int32) – The result group index

  • cta_tile_idx_m (Int32) – CTA tile index along M dimension after rasterization

  • cta_tile_idx_n (Int32) – CTA tile index along N dimension after rasterization

  • problem_shape_m (Int32) – The M dimension of the gemm problem

  • problem_shape_n (Int32) – The N dimension of the gemm problem

  • problem_shape_k (Int32) – The K dimension of the gemm problem

  • cta_tile_count_k (Int32) – Number of tiles along K dimension

__init__(
group_idx: cutlass.cutlass_dsl.Int32,
cta_tile_idx_m: cutlass.cutlass_dsl.Int32,
cta_tile_idx_n: cutlass.cutlass_dsl.Int32,
problem_shape_m: cutlass.cutlass_dsl.Int32,
problem_shape_n: cutlass.cutlass_dsl.Int32,
problem_shape_k: cutlass.cutlass_dsl.Int32,
cta_tile_count_k: cutlass.cutlass_dsl.Int32,
) None#
class cutlass.utils.GroupedGemmGroupSearchState(
start_group_idx: cutlass.cutlass_dsl.Int32,
tile_count_prev_group: cutlass.cutlass_dsl.Int32,
tile_count_searched: cutlass.cutlass_dsl.Int32,
)#

Bases: object

The state of group index search for grouped gemm.

The state will be initialized once and updated in every round of group index search.

Parameters:
  • start_group_idx (Int32) – The group idx to start the search with

  • tile_count_prev_group (Int32) – Number of tiles before the matched group

  • tile_count_searched (Int32) – Number of tiles we have searched. When the matched group is found, it records the number of tiles including the matched group

__init__(
start_group_idx: cutlass.cutlass_dsl.Int32,
tile_count_prev_group: cutlass.cutlass_dsl.Int32,
tile_count_searched: cutlass.cutlass_dsl.Int32,
) None#
cutlass.utils.create_initial_search_state() GroupedGemmGroupSearchState#

Create an initial search state for grouped gemm.

Returns:

A new search state with initial values

Return type:

GroupedGemmGroupSearchState

class cutlass.utils.GroupedGemmTileSchedulerHelper(
group_count: int,
tile_sched_params: PersistentTileSchedulerParams,
cluster_tile_shape_mnk: tuple[int, int, int],
search_state: GroupedGemmGroupSearchState,
)#

Bases: object

A helper to translate the raw block index (x, y, z) from tile scheduler to real CTA tile index for grouped gemm.

Parameters:
  • group_count (int) – Number of groups in current grouped gemm problem

  • tile_sched_params (PersistentTileSchedulerParams) – Parameter used to create the tile scheduler this helper works with

  • cluster_tile_shape_mnk (tuple[int, int, int]) – The shape of cluster tile as (m, n, k)

  • search_state (GroupedGemmGroupSearchState) – The initial search state

__init__(
group_count: int,
tile_sched_params: PersistentTileSchedulerParams,
cluster_tile_shape_mnk: tuple[int, int, int],
search_state: GroupedGemmGroupSearchState,
) None#
delinearize_z(
cta_tile_coord: tuple,
problem_shape_mnkl: cutlass.cute.typing.Tensor,
) GroupSearchResult#

Delinearize the linear z index and return GroupSearchResult.

This function should be used by warps that need to know the CTA tile index on M and N dimensions.

Parameters:
  • cta_tile_coord (tuple of Int32) – The raw CTA coordinate from tile scheduler

  • problem_shape_mnkl (cute.Tensor) – Tensor containing gemm problem size (M, N, K, L) for each group

Returns:

The search result containing group index and tile coordinates

Return type:

GroupSearchResult

search_cluster_tile_count_k(
cta_tile_coord: tuple,
problem_shape_mnkl: cutlass.cute.typing.Tensor,
) Tuple[cutlass.cutlass_dsl.Int32, cutlass.cutlass_dsl.Int32]#

Search the matched group for given linear index and compute the number of tiles along K dimension for the matched group.

This function should be used by warps that are only interested in the number of tiles along K dimension.

Parameters:
  • cta_tile_coord (tuple of Int32) – The raw CTA coordinate from tile scheduler

  • problem_shape_mnkl (cute.Tensor) – Tensor containing gemm problem size (M, N, K, L) for all groups

Returns:

A tuple containing cluster count along K dimension and the group index

Return type:

Tuple[Int32, Int32]

_prefix_sum(
value_per_thread: cutlass.cutlass_dsl.Int32,
) cutlass.cutlass_dsl.Int32#

Perform prefix sum within a full warp.

Parameters:

value_per_thread (Int32) – The value for this thread to contribute to the prefix sum

Returns:

The prefix sum result for this thread

Return type:

Int32

_get_problem_for_group(
problem_shape_mnkl: cutlass.cute.typing.Tensor,
group_idx: cutlass.cutlass_dsl.Int32,
) cutlass.cute.typing.Tensor#

Load gemm problem (m,n,k,l) for the specified group from global memory to register.

Parameters:
  • problem_shape_mnkl (cute.Tensor) – Tensor in global memory with layout (group_count, 4):(4, 1)

  • group_idx (Int32) – The index of the group to load

Returns:

The problem shape tensor for the specified group

Return type:

cute.Tensor

_get_cluster_tile_count_mn(
problem_shape: cutlass.cute.typing.Tensor,
) cutlass.cutlass_dsl.Int32#

Compute total cluster count.

Parameters:

problem_shape (cute.Tensor) – Tensor containing problem shape (m, n, k, l)

Returns:

The total cluster tile count for M and N dimensions

Return type:

Int32

_compute_cta_tile_coord(
cluster_tile_idx: cutlass.cutlass_dsl.Int32,
cta_tile_coord_in_cluster: tuple,
cluster_tile_count_m: cutlass.cutlass_dsl.Int32,
cluster_tile_count_n: cutlass.cutlass_dsl.Int32,
) tuple#

Compute CTA tile indices along M and N dimensions based on the linear index within a group.

It uses the AlongM mode to decompose the linear index onto M and N dimensions.

Parameters:
  • cluster_tile_idx (Int32) – The linear index within a group

  • cta_tile_coord_in_cluster (tuple of Int32) – CTA indices along M and N dimensions within a cluster

  • cluster_tile_count_m (Int32) – The number of clusters along M dimension of the matched group

  • cluster_tile_count_n (Int32) – The number of clusters along N dimension of the matched group

Returns:

A tuple containing CTA tile indices along M and N dimensions

Return type:

tuple of (Int32, Int32)

Search which group the linear index belongs to.

Parameters:
  • linear_idx (Int32) – The linear index to be decomposed

  • problem_shape_mnkl (cute.Tensor) – Tensor containing gemm problem size (M, N, K, L) for all groups

  • init_group_idx (Int32) – The group idx to start the search with

  • init_tile_count_searched (Int32) – The number of tiles we have searched

Returns:

The updated search state

Return type:

GroupedGemmGroupSearchState

_group_search_and_load_problem_shape(
linear_idx: cutlass.cutlass_dsl.Int32,
problem_shape_mnkl: cutlass.cute.typing.Tensor,
start_group_idx: cutlass.cutlass_dsl.Int32,
tile_count_searched: cutlass.cutlass_dsl.Int32,
) Tuple[cutlass.cutlass_dsl.Int32, cutlass.cute.typing.Tensor]#

Perform group search and load problem shape for the matched group.

Parameters:
  • linear_idx (Int32) – The linear index to be decomposed

  • problem_shape_mnkl (cute.Tensor) – Tensor containing gemm problem size (M, N, K, L) for all groups

  • start_group_idx (Int32) – The group idx to start the search with

  • tile_count_searched (Int32) – The number of tiles we have searched

Returns:

A tuple containing the final group index and the problem shape tensor

Return type:

Tuple[Int32, cute.Tensor]

class cutlass.utils.HardwareInfo(device_id: int = 0)#

Bases: object

device_id: CUDA device ID to get the hardware info.

__init__(device_id: int = 0)#
get_max_active_clusters(cluster_size: int) int#
get_l2_cache_size_in_bytes() int#
get_device_multiprocessor_count() int#
_checkCudaErrors(result) None#
_cudaGetErrorEnum(error) str#
_cuda_driver_version_ge(major: int, minor: int) bool#
_cuda_driver_version_lt(major: int, minor: int) bool#
_empty_kernel()#
_host_function()#
_get_device_function() None#
class cutlass.utils.Agent(value)#

Bases: Enum

Agent indicates what is participating in the pipeline synchronization.

Thread = 1#
ThreadBlock = 2#
ThreadBlockCluster = 3#
class cutlass.utils.CooperativeGroup(
agent: Agent,
size: int = 1,
alignment: int = 1,
)#

Bases: object

CooperativeGroup contains size and alignment restrictions for an Agent.

__init__(
agent: Agent,
size: int = 1,
alignment: int = 1,
)#
class cutlass.utils.PipelineUserType(value)#

Bases: Enum

An enumeration.

Producer = 1#
Consumer = 2#
class cutlass.utils.PipelineState(stages: int, count, index, phase)#

Bases: object

Pipeline state contains an index and phase bit corresponding to the current position in the circular buffer.

__init__(stages: int, count, index, phase)#
clone() PipelineState#
property index: cutlass.cutlass_dsl.Int32#
property count: cutlass.cutlass_dsl.Int32#
property stages: int#
property phase: cutlass.cutlass_dsl.Int32#
reset_count()#
advance()#
reverse()#
class cutlass.utils.PipelineAsync(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
)#

Bases: object

PipelineAsync is a generic pipeline class where both the producer and consumer are AsyncThreads. It also serves as a base class for specialized pipeline classes.

This class implements a producer-consumer pipeline pattern where both sides operate asynchronously. The pipeline maintains synchronization state using barrier objects to coordinate between producer and consumer threads.

The pipeline state transitions of one pipeline entry(mbarrier) can be represented as:

Table 1 Pipeline State Transitions#

Barrier

State

p.acquire

p.commit

c.wait

c.release

empty_bar

empty

<Return>

n/a

n/a

empty_bar

wait

<Block>

n/a

n/a

-> empty

full_bar

wait

n/a

-> full

<Block >

n/a

full_bar

full

n/a

<Return>

n/a

Where:

  • p: producer

  • c: consumer

  • <Block>: This action is blocked until transition to a state allow it to proceed by other side - e.g. p.acquire() is blocked until empty_bar transition to empty state by c.release()

Array of mbarriers as circular buffer:

     Advance Direction
   <-------------------

    Producer   Consumer
        |         ^
        V         |
   +-----------------+
 --|X|X|W|D|D|D|D|R|X|<-.
/  +-----------------+   \
|                        |
`------------------------'

Where:

  • X: Empty buffer (initial state)

  • W: Producer writing (producer is waiting for buffer to be empty)

  • D: Data ready (producer has written data to buffer)

  • R: Consumer reading (consumer is consuming data from buffer)

Example:

# Create pipeline with 5 stages
pipeline = PipelineAsync.create(
    barrier_storage=smem_ptr,       # smem pointer for array of mbarriers in shared memory
    num_stages=5,                   # number of pipeline stages
    producer_group=producer_warp,
    consumer_group=consumer_warp
)

# Producer side
producer = pipeline.make_pipeline_producer(producer_warp)
for i in range(num_iterations):
    producer.acquire()  # Wait for buffer to be empty
    # Write data to pipeline buffer
    producer.commit()   # Signal buffer is full
    producer.advance()  # Move index to next stage

# Consumer side
consumer = pipeline.make_pipeline_consumer(consumer_warp)
for i in range(num_iterations):
    consumer.wait()     # Wait for buffer to be full
    # Read data from pipeline buffer
    consumer.release()  # Signal buffer is empty
    consumer.advance()  # Move index to next stage
sync_object_array_full: SyncObjectArray#
sync_object_array_empty: SyncObjectArray#
num_stages: int#
producer_mask: cutlass.cutlass_dsl.Int32 | None#
consumer_mask: cutlass.cutlass_dsl.Int32 | None#
static _make_sync_object_array(
barrier_storage: cutlass.cute.typing.Pointer,
num_stages: int,
agent: tuple[PipelineOp, CooperativeGroup],
tx_count: int = 0,
) SyncObjectArray#

Returns a SyncObjectArray corresponding to an agent’s PipelineOp.

static create(
barrier_storage: cutlass.cute.typing.Pointer,
num_stages: int,
producer_group: CooperativeGroup,
consumer_group: CooperativeGroup,
producer_mask: cutlass.cutlass_dsl.Int32 | None = None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None = None,
)#

Creates and initializes a new PipelineAsync instance.

This helper function computes necessary attributes and returns an instance of PipelineAsync with the specified configuration for producer and consumer synchronization.

Parameters:
  • barrier_storage (cute.Pointer) – Pointer to the shared memory address for this pipeline’s mbarriers

  • num_stages (int) – Number of buffer stages for this pipeline

  • producer_group (CooperativeGroup) – CooperativeGroup for the producer agent

  • consumer_group (CooperativeGroup) – CooperativeGroup for the consumer agent

  • producer_mask (Int32, optional) – Mask for signaling arrives for the producer agent, defaults to None

  • consumer_mask (Int32, optional) – Mask for signaling arrives for the consumer agent, defaults to None

Returns:

A new PipelineAsync instance

Return type:

PipelineAsync

Raises:

ValueError – If barrier_storage is not a cute.Pointer instance

producer_acquire(
state: PipelineState,
try_acquire_token: cutlass.cutlass_dsl.Boolean | None = None,
)#
producer_try_acquire(
state: PipelineState,
)#
producer_commit(
state: PipelineState,
)#
consumer_wait(
state: PipelineState,
try_wait_token: cutlass.cutlass_dsl.Boolean | None = None,
)#
consumer_try_wait(
state: PipelineState,
)#
consumer_release(
state: PipelineState,
)#
producer_get_barrier(
state: PipelineState,
) cutlass.cute.typing.Pointer#
producer_tail(
state: PipelineState,
)#

Make sure the last used buffer empty signal is visible to producer. Producer tail is usually executed by producer before exit, to avoid dangling mbarrier arrive signals after kernel exit.

Parameters:

state (PipelineState) – The pipeline state that points to next useful buffer

__init__(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
) None#
class cutlass.utils.PipelineTmaAsync(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
is_signalling_thread: cutlass.cutlass_dsl.Boolean,
)#

Bases: PipelineAsync

PipelineTmaAsync is used for TMA producers and AsyncThread consumers (e.g. Hopper mainloops).

is_signalling_thread: cutlass.cutlass_dsl.Boolean#
static init_empty_barrier_arrive_signal(
cta_layout_vmnk: cutlass.cute.typing.Layout,
tidx: cutlass.cutlass_dsl.Int32,
)#

Initialize the empty barrier arrive signal This function returns the destination cta rank and a boolean indicating if the signalling thread is the same as the current thread

static create(
barrier_storage: cutlass.cute.typing.Pointer,
num_stages: int,
producer_group: CooperativeGroup,
consumer_group: CooperativeGroup,
tx_count: int,
cta_layout_vmnk: cutlass.cute.typing.Layout | None = None,
tidx: cutlass.cutlass_dsl.Int32 | None = None,
)#

This helper function computes any necessary attributes and returns an instance of PipelineTmaAsync. :param barrier_storage: Pointer to the smem address for this pipeline’s mbarriers :type barrier_storage: cute.Pointer :param num_stages: Number of buffer stages for this pipeline :type num_stages: Int32 :param producer_group: CooperativeGroup for the producer agent :type producer_group: CooperativeGroup :param consumer_group: CooperativeGroup for the consumer agent :type consumer_group: CooperativeGroup :param tx_count: Number of bytes expected to be written to the transaction barrier for one stage :type tx_count: int :param cta_layout_vmnk: Layout of the cluster shape :type cta_layout_vmnk: cute.Layout | None :param tidx: thread index to consumer async threads :type tidx: Int32 | None

producer_acquire(
state: PipelineState,
try_acquire_token: cutlass.cutlass_dsl.Boolean | None = None,
)#

TMA producer commit conditionally waits on buffer empty and sets the transaction barrier.

producer_commit(
state: PipelineState,
)#

TMA producer commit is a noop since TMA instruction itself updates the transaction count.

consumer_release(
state: PipelineState,
)#

TMA consumer release conditionally signals the empty buffer to the producer.

__init__(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
is_signalling_thread: cutlass.cutlass_dsl.Boolean,
) None#
class cutlass.utils.PipelineTmaUmma(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
is_leader_cta: bool,
cta_group: CtaGroup,
)#

Bases: PipelineAsync

PipelineTmaUmma is used for TMA producers and UMMA consumers (e.g. Blackwell mainloops).

is_leader_cta: bool#
cta_group: CtaGroup#
static _compute_mcast_arrival_mask(
cta_layout_vmnk: cutlass.cute.typing.Layout,
)#

Computes a mask for signaling arrivals to multicasting threadblocks.

static _compute_is_leader_cta(
cta_layout_vmnk: cutlass.cute.typing.Layout,
)#

Computes leader threadblocks for 2CTA kernels. For 1CTA, all threadblocks are leaders.

static create(
barrier_storage: cutlass.cute.typing.Pointer,
num_stages: int,
producer_group: CooperativeGroup,
consumer_group: CooperativeGroup,
tx_count: int,
cta_layout_vmnk: cutlass.cute.typing.Layout | None = None,
)#

This helper function computes any necessary attributes and returns an instance of PipelineTmaUmma. :param barrier_storage: Pointer to the smem address for this pipeline’s mbarriers :type barrier_storage: cute.Pointer :param num_stages: Number of buffer stages for this pipeline :type num_stages: Int32 :param producer_group: CooperativeGroup for the producer agent :type producer_group: CooperativeGroup :param consumer_group: CooperativeGroup for the consumer agent :type consumer_group: CooperativeGroup :param tx_count: Number of bytes expected to be written to the transaction barrier for one stage :type tx_count: int :param cta_layout_vmnk: Layout of the cluster shape :type cta_layout_vmnk: cute.Layout | None

consumer_release(
state: PipelineState,
)#

UMMA consumer release buffer empty, cta_group needs to be provided.

producer_acquire(
state: PipelineState,
try_acquire_token: cutlass.cutlass_dsl.Boolean | None = None,
)#

TMA producer commit conditionally waits on buffer empty and sets the transaction barrier for leader threadblocks.

producer_commit(
state: PipelineState,
)#

TMA producer commit is a NOP. The transaction barrier signals the commit upon completion of the TMA.

__init__(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
is_leader_cta: bool,
cta_group: CtaGroup,
) None#
class cutlass.utils.PipelineTmaMultiConsumersAsync(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
is_leader_cta: bool,
sync_object_array_empty_umma: SyncObjectArray,
sync_object_array_empty_async: SyncObjectArray,
cta_group: CtaGroup,
)#

Bases: PipelineAsync

PipelineTmaMultiConsumersAsync is used for TMA producers and UMMA+Async consumers.

is_leader_cta: bool#
sync_object_array_empty_umma: SyncObjectArray#
sync_object_array_empty_async: SyncObjectArray#
cta_group: CtaGroup#
static create(
barrier_storage: cutlass.cute.typing.Pointer,
num_stages: int,
producer_group: CooperativeGroup,
consumer_group_umma: CooperativeGroup,
consumer_group_async: CooperativeGroup,
tx_count: int,
cta_layout_vmnk: cutlass.cute.typing.Layout | None = None,
)#

This helper function computes any necessary attributes and returns an instance of PipelineTmaMultiConsumersAsync. :param barrier_storage: Pointer to the smem address for this pipeline’s mbarriers :type barrier_storage: cute.Pointer :param num_stages: Number of buffer stages for this pipeline :type num_stages: Int32 :param producer_group: CooperativeGroup for the producer agent :type producer_group: CooperativeGroup :param consumer_group_umma: CooperativeGroup for the UMMA consumer agent :type consumer_group_umma: CooperativeGroup :param consumer_group_async: CooperativeGroup for the AsyncThread consumer agent :type consumer_group_async: CooperativeGroup :param tx_count: Number of bytes expected to be written to the transaction barrier for one stage :type tx_count: int :param cta_layout_vmnk: Layout of the cluster shape :type cta_layout_vmnk: cute.Layout | None

producer_acquire(
state: PipelineState,
try_acquire_token: cutlass.cutlass_dsl.Boolean | None = None,
)#

TMA producer acquire waits on buffer empty and sets the transaction barrier for leader threadblocks.

producer_commit(
state: PipelineState,
)#

TMA producer commit is a noop since TMA instruction itself updates the transaction count.

consumer_release(
state: PipelineState,
op_type: PipelineOp,
)#
__init__(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
is_leader_cta: bool,
sync_object_array_empty_umma: SyncObjectArray,
sync_object_array_empty_async: SyncObjectArray,
cta_group: CtaGroup,
) None#
class cutlass.utils.PipelineAsyncUmma(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
cta_group: CtaGroup,
)#

Bases: PipelineAsync

PipelineAsyncUmma is used for AsyncThread producers and UMMA consumers (e.g. Blackwell input fusion pipelines).

cta_group: CtaGroup#
static _compute_leading_cta_rank(cta_v_size)#

Computes the leading CTA rank.

static _compute_is_leader_cta(
cta_layout_vmnk: cutlass.cute.typing.Layout,
)#

Computes leader threadblocks for 2CTA kernels. For 1CTA, all threadblocks are leaders.

static _compute_peer_cta_mask(
cta_layout_vmnk: cutlass.cute.typing.Layout,
)#

Computes a mask for signaling arrivals to multicasting threadblocks.

static create(
barrier_storage: cutlass.cute.typing.Pointer,
num_stages: int,
producer_group: CooperativeGroup,
consumer_group: CooperativeGroup,
cta_layout_vmnk: cutlass.cute.typing.Layout | None = None,
)#

This helper function computes any necessary attributes and returns an instance of PipelineAsyncUmma. :param barrier_storage: Pointer to the smem address for this pipeline’s mbarriers :type barrier_storage: cute.Pointer :param num_stages: Number of buffer stages for this pipeline :type num_stages: Int32 :param producer_group: CooperativeGroup for the producer agent :type producer_group: CooperativeGroup :param consumer_group: CooperativeGroup for the consumer agent :type consumer_group: CooperativeGroup :param cta_layout_vmnk: Layout of the cluster shape :type cta_layout_vmnk: cute.Layout | None

consumer_release(
state: PipelineState,
)#

UMMA consumer release buffer empty, cta_group needs to be provided.

__init__(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
cta_group: CtaGroup,
) None#
class cutlass.utils.PipelineUmmaAsync(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
cta_group: CtaGroup,
)#

Bases: PipelineAsync

PipelineTmaUmma is used for UMMA producers and AsyncThread consumers (e.g. Blackwell accumulator pipelines).

cta_group: CtaGroup#
static _compute_tmem_sync_mask(
cta_layout_vmnk: cutlass.cute.typing.Layout,
)#

Computes a mask to signal completion of tmem buffers for 2CTA kernels.

static _compute_peer_cta_rank()#

Computes a mask to signal release of tmem buffers for 2CTA kernels.

static create(
barrier_storage: cutlass.cute.typing.Pointer,
num_stages: int,
producer_group: CooperativeGroup,
consumer_group: CooperativeGroup,
cta_layout_vmnk: cutlass.cute.typing.Layout | None = None,
)#

This helper function computes any necessary attributes and returns an instance of PipelineUmmaAsync. :param barrier_storage: Pointer to the smem address for this pipeline’s mbarriers :type barrier_storage: cute.Pointer :param num_stages: Number of buffer stages for this pipeline :type num_stages: Int32 :param producer_group: CooperativeGroup for the producer agent :type producer_group: CooperativeGroup :param consumer_group: CooperativeGroup for the consumer agent :type consumer_group: CooperativeGroup :param cta_layout_vmnk: Layout of the cluster shape :type cta_layout_vmnk: cute.Layout | None

producer_commit(
state: PipelineState,
)#

UMMA producer commit buffer full, cta_group needs to be provided.

producer_tail(
state: PipelineState,
)#

Make sure the last used buffer empty signal is visible to producer. Producer tail is usually executed by producer before exit, to avoid dangling mbarrier arrive signals after kernel exit.

Parameters:

state (PipelineState) – The pipeline state that points to next useful buffer

__init__(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
cta_group: CtaGroup,
) None#
class cutlass.utils.PipelineTmaStore(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
)#

Bases: PipelineAsync

PipelineTmaStore is used for synchronizing TMA stores in the epilogue. It does not use mbarriers.

static create(
num_stages: int,
producer_group: CooperativeGroup,
)#

This helper function computes any necessary attributes and returns an instance of PipelineTmaStore. :param num_stages: Number of buffer stages for this pipeline :type num_stages: Int32 :param producer_group: CooperativeGroup for the producer agent :type producer_group: CooperativeGroup

producer_acquire()#
producer_commit()#
consumer_wait()#
consumer_release()#
producer_tail()#

Make sure the last used buffer empty signal is visible to producer. Producer tail is usually executed by producer before exit, to avoid dangling mbarrier arrive signals after kernel exit.

Parameters:

state (PipelineState) – The pipeline state that points to next useful buffer

__init__(
sync_object_array_full: SyncObjectArray,
sync_object_array_empty: SyncObjectArray,
num_stages: int,
producer_mask: cutlass.cutlass_dsl.Int32 | None,
consumer_mask: cutlass.cutlass_dsl.Int32 | None,
) None#
class cutlass.utils.PipelineOp(value)#

Bases: Enum

PipelineOp assigns an operation to an agent corresponding to a specific hardware feature.

AsyncThread = 1#
TCGen05Mma = 2#
TmaLoad = 3#
TmaStore = 4#
Composite = 5#