cpasync submodule#

class cutlass.cute.nvgpu.cpasync.LoadCacheMode(value)#

Bases: Enum

An enumeration for the possible cache modes of a non-bulk cp.async instruction.

class cutlass.cute.nvgpu.cpasync.CopyG2SOp( cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always, )#

Bases: CopyOp

Non-bulk asynchronous GMEM to SMEM Copy Operation.

See the PTX documentation.

__init__( cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always, ) → None#

class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SOp( cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>, )#

Bases: TmaCopyOp

Bulk tensor asynchrnous GMEM to SMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

cta_group: CtaGroup = 1#

__init__( cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>, ) → None#

class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SMulticastOp( cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>, )#

Bases: TmaCopyOp

Bulk tensor asynchrnous multicast GMEM to SMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

cta_group: CtaGroup = 1#

__init__( cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>, ) → None#

class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileS2GOp#

Bases: TmaCopyOp

Bulk tensor asynchronous SMEM to GMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

__init__() → None#

class cutlass.cute.nvgpu.cpasync.CopyReduceBulkTensorTileS2GOp( reduction_kind: cutlass._mlir.dialects.cute.ReductionOp = cutlass._mlir.dialects.cute.ReductionOp.ADD, )#

Bases: TmaCopyOp

Bulk tensor asynchronous SMEM to GMEM Reduction Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

__init__( reduction_kind: cutlass._mlir.dialects.cute.ReductionOp = cutlass._mlir.dialects.cute.ReductionOp.ADD, ) → None#

class cutlass.cute.nvgpu.cpasync.CopyDsmemStoreOp#

Bases: CopyOp

Asynchronous Store operation to DSMEM with explicit synchronization.

See the PTX documentation.

__init__() → None#

cutlass.cute.nvgpu.cpasync.make_tiled_tma_atom( op: CopyBulkTensorTileG2SOp | CopyBulkTensorTileG2SMulticastOp | CopyBulkTensorTileS2GOp | CopyReduceBulkTensorTileS2GOp, gmem_tensor: cutlass.cute.typing.Tensor, smem_layout_: cutlass.cute.typing.Layout | cutlass.cute.typing.ComposedLayout, cta_tiler: cutlass.cute.typing.Tiler, num_multicast: int = 1, *, internal_type: Type[cutlass.cute.typing.Numeric] | None = None, loc=None, ip=None, ) → Tuple[CopyAtom, cutlass.cute.typing.Tensor]#

Makes a TMA Copy Atom in the .tile mode to copy tiles of a GMEM tensor to/from SMEM buffer with the given Layout.

Given

a GMEM tensor
a SMEM layout
a CTA-level Tiler

this function figures out the bulk tensor asynchronous copy instruction to use with the maximum “TMA vector length” to copy tiles of the GMEM tensor to/from an SMEM buffer with the provided layout while maintaining consistency with the provided Tiler.

This function returns two results:

the Copy Atom
a TMA tensor that maps logical coordinates of the GMEM tensor to coordinates consumed by the TMA unit. TMA tensors contain basis stride elements that enable their associated layout to compute coordinates. Like other CuTe tensors, TMA tensors can be partitioned.

Parameters:

op (TMAOp) – The TMA Copy Operation to construct an Atom
gmem_tensor (Tensor) – The GMEM tensor involved in the Copy
smem_layout (Union[Layout, ComposedLayout]) – The SMEM layout to construct the Copy Atom, either w/ or w/o the stage mode
cta_tiler (Tiler) – The CTA Tiler to use
num_multicast (int) – The multicast factor
internal_type (Type[Numeric]) – Optional internal data type to use when the tensor data type is not supported by the TMA unit

Returns:

A TMA Copy Atom associated with the TMA tensor

Return type:

Tuple[atom.CopyAtom, Tensor]

cutlass.cute.nvgpu.cpasync.tma_partition( atom: CopyAtom, cta_coord: cutlass.cute.typing.Coord, cta_layout: cutlass.cute.typing.Layout, smem_tensor: cutlass.cute.typing.Tensor, gmem_tensor: cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → Tuple[cutlass.cute.typing.Tensor, cutlass.cute.typing.Tensor]#: Tiles the GMEM and SMEM tensors for the provided TMA Copy Atom.

cutlass.cute.nvgpu.cpasync.create_tma_multicast_mask( cta_layout_vmnk: cutlass.cute.typing.Layout, cta_coord_vmnk: cutlass.cute.typing.Coord, mcast_mode: int, *, loc=None, ip=None, ) → cutlass.cute.typing.Int16#

Computes a multicast mask for a TMA load Copy.

Parameters:

cta_layout_vmnk (Layout) – The VMNK layout of the cluster
cta_coord_vmnk (Coord) – The VMNK coordinate of the current CTA
mcast_mode (int) – The tensor mode in which to multicast

Returns:

The resulting mask

Return type:

Int16

cutlass.cute.nvgpu.cpasync.prefetch_descriptor( tma_atom: CopyAtom, *, loc=None, ip=None, ) → None#: Prefetches the TMA descriptor associated with the TMA Atom.

cutlass.cute.nvgpu.cpasync.copy_tensormap( tma_atom: CopyAtom, tensormap_ptr: cutlass.cute.typing.Pointer, *, loc=None, ip=None, ) → None#

Copies the tensormap held by a TMA Copy Atom to the memory location pointed to by the provided pointer.

Parameters:

tma_atom (CopyAtom) – The TMA Copy Atom
tensormap_ptr (Pointer) – The pointer to the memory location to copy the tensormap to

cutlass.cute.nvgpu.cpasync.update_tma_descriptor( tma_atom: CopyAtom, gmem_tensor: cutlass.cute.typing.Tensor, tma_desc_ptr: cutlass.cute.typing.Pointer, *, loc=None, ip=None, ) → None#

Updates the TMA descriptor in the memory location pointed to by the provided pointer using information from a TMA Copy Atom and the provided GMEM tensor.

Specifically, the following fields of the TMA descriptor will be updated:

the GMEM tensor base address
the GMEM tensor shape
the GMEM tensor stride

Other fields of the TMA descriptor are left unchanged.

Parameters:

tma_atom (CopyAtom) – The TMA Copy Atom
gmem_tensor (Tensor) – The GMEM tensor
tensormap_ptr (Pointer) – The pointer to the memory location of the descriptor to udpate

cutlass.cute.nvgpu.cpasync.fence_tma_desc_acquire( tma_desc_ptr: cutlass.cute.typing.Pointer, *, loc=None, ip=None, ) → None#: See the PTX documentation.

cutlass.cute.nvgpu.cpasync.cp_fence_tma_desc_release( tma_desc_global_ptr: cutlass.cute.typing.Pointer, tma_desc_shared_ptr: cutlass.cute.typing.Pointer, *, loc=None, ip=None, ) → None#: See the PTX documentation.

cutlass.cute.nvgpu.cpasync.fence_tma_desc_release(*, loc=None, ip=None) → None#: See the PTX documentation.