cpasync submodule#

class cutlass.cute.nvgpu.cpasync.LoadCacheMode(value)#

Bases: Enum

An enumeration for the possible cache modes of a non-bulk cp.async instruction.

See the PTX documentation.

class cutlass.cute.nvgpu.cpasync.CopyG2SOp(
cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always,
)#

Bases: CopyOp

Non-bulk asynchronous GMEM to SMEM Copy Operation.

See the PTX documentation.

__init__(
cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always,
) None#
class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SOp(
cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
)#

Bases: TmaCopyOp

Bulk tensor asynchrnous GMEM to SMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

cta_group: CtaGroup = 1#
__init__(
cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
) None#
class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SMulticastOp(
cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
)#

Bases: TmaCopyOp

Bulk tensor asynchrnous multicast GMEM to SMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

cta_group: CtaGroup = 1#
__init__(
cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
) None#
class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileS2GOp#

Bases: TmaCopyOp

Bulk tensor asynchronous SMEM to GMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

__init__() None#
class cutlass.cute.nvgpu.cpasync.CopyReduceBulkTensorTileS2GOp(
reduction_kind: cutlass._mlir.dialects.cute.ReductionOp = cutlass._mlir.dialects.cute.ReductionOp.ADD,
)#

Bases: TmaCopyOp

Bulk tensor asynchronous SMEM to GMEM Reduction Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

__init__(
reduction_kind: cutlass._mlir.dialects.cute.ReductionOp = cutlass._mlir.dialects.cute.ReductionOp.ADD,
) None#
class cutlass.cute.nvgpu.cpasync.CopyDsmemStoreOp#

Bases: CopyOp

Asynchronous Store operation to DSMEM with explicit synchronization.

See the PTX documentation.

__init__() None#
cutlass.cute.nvgpu.cpasync.make_tiled_tma_atom(
op: CopyBulkTensorTileG2SOp | CopyBulkTensorTileG2SMulticastOp | CopyBulkTensorTileS2GOp | CopyReduceBulkTensorTileS2GOp,
gmem_tensor: cutlass.cute.typing.Tensor,
smem_layout_: cutlass.cute.typing.Layout | cutlass.cute.typing.ComposedLayout,
cta_tiler: cutlass.cute.typing.Tiler,
num_multicast: int = 1,
*,
internal_type: Type[cutlass.cute.typing.Numeric] | None = None,
loc=None,
ip=None,
) Tuple[CopyAtom, cutlass.cute.typing.Tensor]#

Makes a TMA Copy Atom in the .tile mode to copy tiles of a GMEM tensor to/from SMEM buffer with the given Layout.

Given

  • a GMEM tensor

  • a SMEM layout

  • a CTA-level Tiler

this function figures out the bulk tensor asynchronous copy instruction to use with the maximum “TMA vector length” to copy tiles of the GMEM tensor to/from an SMEM buffer with the provided layout while maintaining consistency with the provided Tiler.

This function returns two results:

  1. the Copy Atom

  2. a TMA tensor that maps logical coordinates of the GMEM tensor to coordinates consumed by the TMA unit. TMA tensors contain basis stride elements that enable their associated layout to compute coordinates. Like other CuTe tensors, TMA tensors can be partitioned.

Parameters:
  • op (TMAOp) – The TMA Copy Operation to construct an Atom

  • gmem_tensor (Tensor) – The GMEM tensor involved in the Copy

  • smem_layout (Union[Layout, ComposedLayout]) – The SMEM layout to construct the Copy Atom, either w/ or w/o the stage mode

  • cta_tiler (Tiler) – The CTA Tiler to use

  • num_multicast (int) – The multicast factor

  • internal_type (Type[Numeric]) – Optional internal data type to use when the tensor data type is not supported by the TMA unit

Returns:

A TMA Copy Atom associated with the TMA tensor

Return type:

Tuple[atom.CopyAtom, Tensor]

cutlass.cute.nvgpu.cpasync.tma_partition(
atom: CopyAtom,
cta_coord: cutlass.cute.typing.Coord,
cta_layout: cutlass.cute.typing.Layout,
smem_tensor: cutlass.cute.typing.Tensor,
gmem_tensor: cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) Tuple[cutlass.cute.typing.Tensor, cutlass.cute.typing.Tensor]#

Tiles the GMEM and SMEM tensors for the provided TMA Copy Atom.

cutlass.cute.nvgpu.cpasync.create_tma_multicast_mask(
cta_layout_vmnk: cutlass.cute.typing.Layout,
cta_coord_vmnk: cutlass.cute.typing.Coord,
mcast_mode: int,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Int16#

Computes a multicast mask for a TMA load Copy.

Parameters:
  • cta_layout_vmnk (Layout) – The VMNK layout of the cluster

  • cta_coord_vmnk (Coord) – The VMNK coordinate of the current CTA

  • mcast_mode (int) – The tensor mode in which to multicast

Returns:

The resulting mask

Return type:

Int16

cutlass.cute.nvgpu.cpasync.prefetch_descriptor(
tma_atom: CopyAtom,
*,
loc=None,
ip=None,
) None#

Prefetches the TMA descriptor associated with the TMA Atom.

cutlass.cute.nvgpu.cpasync.copy_tensormap(
tma_atom: CopyAtom,
tensormap_ptr: cutlass.cute.typing.Pointer,
*,
loc=None,
ip=None,
) None#

Copies the tensormap held by a TMA Copy Atom to the memory location pointed to by the provided pointer.

Parameters:
  • tma_atom (CopyAtom) – The TMA Copy Atom

  • tensormap_ptr (Pointer) – The pointer to the memory location to copy the tensormap to

cutlass.cute.nvgpu.cpasync.update_tma_descriptor(
tma_atom: CopyAtom,
gmem_tensor: cutlass.cute.typing.Tensor,
tma_desc_ptr: cutlass.cute.typing.Pointer,
*,
loc=None,
ip=None,
) None#

Updates the TMA descriptor in the memory location pointed to by the provided pointer using information from a TMA Copy Atom and the provided GMEM tensor.

Specifically, the following fields of the TMA descriptor will be updated:

  1. the GMEM tensor base address

  2. the GMEM tensor shape

  3. the GMEM tensor stride

Other fields of the TMA descriptor are left unchanged.

Parameters:
  • tma_atom (CopyAtom) – The TMA Copy Atom

  • gmem_tensor (Tensor) – The GMEM tensor

  • tensormap_ptr (Pointer) – The pointer to the memory location of the descriptor to udpate

cutlass.cute.nvgpu.cpasync.fence_tma_desc_acquire(
tma_desc_ptr: cutlass.cute.typing.Pointer,
*,
loc=None,
ip=None,
) None#

See the PTX documentation.

cutlass.cute.nvgpu.cpasync.cp_fence_tma_desc_release(
tma_desc_global_ptr: cutlass.cute.typing.Pointer,
tma_desc_shared_ptr: cutlass.cute.typing.Pointer,
*,
loc=None,
ip=None,
) None#

See the PTX documentation.

cutlass.cute.nvgpu.cpasync.fence_tma_desc_release(*, loc=None, ip=None) None#

See the PTX documentation.