cpasync submodule#

class cutlass.cute.nvgpu.cpasync.LoadCacheMode(value)#

Bases: Enum

An enumeration for the possible cache modes of a non-bulk cp.async instruction.

See the PTX documentation.

class cutlass.cute.nvgpu.cpasync.CopyG2SOp(
cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always,
)#

Bases: CopyOp

Non-bulk asynchronous GMEM to SMEM Copy Operation.

See the PTX documentation.

__init__(
cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always,
) None#
class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SOp(
cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
)#

Bases: CopyOp

Bulk tensor asynchrnous GMEM to SMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

cta_group: CtaGroup = 1#
admissible_archs = ['sm_90', 'sm_90a', 'sm_100a']#
__init__(
cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
) None#
class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SMulticastOp(
cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
)#

Bases: CopyOp

Bulk tensor asynchrnous multicast GMEM to SMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

cta_group: CtaGroup = 1#
admissible_archs = ['sm_90', 'sm_90a', 'sm_100a']#
__init__(
cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
) None#
class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileS2GOp#

Bases: CopyOp

Bulk tensor asynchrnous SMEM to GMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

admissible_archs = ['sm_90', 'sm_90a', 'sm_100a']#
__init__() None#
cutlass.cute.nvgpu.cpasync.make_tma_tile_atom(
op: CopyBulkTensorTileG2SOp | CopyBulkTensorTileG2SMulticastOp | CopyBulkTensorTileS2GOp,
gmem_tensor: cutlass.cute.typing.Tensor,
smem_layout: cutlass.cute.typing.Layout,
cta_tiler: cutlass.cute.typing.Tiler,
num_multicast: int = 1,
*,
internal_type: Type[cutlass.cute.typing.Numeric] | None = None,
loc=None,
ip=None,
) Tuple[CopyAtom, cutlass.cute.typing.Tensor]#

Makes a TMA Copy Atom in the .tile mode to copy tiles of a GMEM tensor to/from and SMEM buffer with the given Layout.

Given

  • a GMEM tensor

  • a SMEM layout

  • a CTA-level Tiler

this function figures out the bulk tensor asynchronous copy instruction to use with the maximum “TMA vector length” to copy tiles of the GMEM tensor to/from an SMEM buffer with the provided layout and consistent with the provided Tiler.

This function returns two results:

  1. the Copy Atom

  2. the so-called TMA tensor used to map logical coordinates of the GMEM tensor to coordinates that the TMA unit can consume. TMA tensors have so-called basis stride elements so that the associated layout can output coordinates. Otherwise, TMA tensors can be partitioned similarly to any other CuTe tensors using the algebra.

Parameters:
  • op (Union[CopyBulkTensorTileG2SOp, CopyBulkTensorTileG2SMulticastOp, CopyBulkTensorTileS2GOp]) – The Copy Operation to construct an Atom for

  • gmem_tensor (Tensor) – The GMEM tensor involved in the Copy

  • smem_layout (Layout) – The SMEM layout to construct the Copy Atom for

  • cta_tiler (Tiler) – The CTA Tiler to use

  • num_multicast (int) – The multicast factor

  • internal_type (Type[Numeric]) – An optional parameter for the internal data type to use when the actual data type is not supported by the TMA unit

Returns:

A Copy Atom for this Operation and the associated TMA tensor

Return type:

Tuple[core.CopyAtom, Tensor]

cutlass.cute.nvgpu.cpasync.tma_partition(
atom: CopyAtom,
cta_coord: cutlass.cute.typing.Coord,
cta_layout: cutlass.cute.typing.Layout,
smem_tensor: cutlass.cute.typing.Tensor,
gmem_tensor: cutlass.cute.typing.Tensor,
*,
loc=None,
ip=None,
) Tuple[cutlass.cute.typing.Tensor, cutlass.cute.typing.Tensor]#

Tiles the GMEM and SMEM tensors for the provided TMA Copy Atom.

cutlass.cute.nvgpu.cpasync.create_tma_multicast_mask(
cta_layout_vmnk: cutlass.cute.typing.Layout,
cta_coord_vmnk: cutlass.cute.typing.Coord,
mcast_mode: int,
*,
loc=None,
ip=None,
) cutlass.cute.typing.Int16#

Computes a multicast mask for a TMA load Copy.

Parameters:
  • cta_layout_vmnk (Layout) – The VMNK layout of the cluster

  • cta_coord_vmnk (Coord) – The VMNK coordinate of the current CTA

  • mcast_mode (int) – The tensor mode in which to multicast

Returns:

The resulting mask

Return type:

Int16

cutlass.cute.nvgpu.cpasync.prefetch_descriptor(
tma_atom: CopyAtom,
*,
loc=None,
ip=None,
) None#

Prefetches the TMA descriptor associated with the TMA Atom.

cutlass.cute.nvgpu.cpasync.copy_tensormap(
tma_atom: CopyAtom,
tensormap_ptr: cutlass.cute.typing.Pointer,
*,
loc=None,
ip=None,
) None#

Copies the tensormap held by a TMA Copy Atom to the memory location pointed to by the provided pointer.

Parameters:
  • tma_atom (CopyAtom) – The TMA Copy Atom

  • tensormap_ptr (Pointer) – The pointer to the memory location to copy the tensormap to

cutlass.cute.nvgpu.cpasync.update_tma_descriptor(
tma_atom: CopyAtom,
gmem_tensor: cutlass.cute.typing.Tensor,
tma_desc_ptr: cutlass.cute.typing.Pointer,
*,
loc=None,
ip=None,
) None#

Updates the TMA descriptor in the memory location pointed to by the provided pointer using information from a TMA Copy Atom and the provided GMEM tensor.

Specifically, the following fields of the TMA descriptor will be updated:

  1. the GMEM tensor base address

  2. the GMEM tensor shape

  3. the GMEM tensor stride

Other fields of the TMA descriptor are left unchanged.

Parameters:
  • tma_atom (CopyAtom) – The TMA Copy Atom

  • gmem_tensor (Tensor) – The GMEM tensor

  • tensormap_ptr (Pointer) – The pointer to the memory location of the descriptor to udpate

cutlass.cute.nvgpu.cpasync.fence_tma_desc_acquire(
tma_desc_ptr: cutlass.cute.typing.Pointer,
*,
loc=None,
ip=None,
) None#

See the PTX documentation.

cutlass.cute.nvgpu.cpasync.cp_fence_tma_desc_release(
tma_desc_global_ptr: cutlass.cute.typing.Pointer,
tma_desc_shared_ptr: cutlass.cute.typing.Pointer,
*,
loc=None,
ip=None,
) None#

See the PTX documentation.

cutlass.cute.nvgpu.cpasync.fence_tma_desc_release(*, loc=None, ip=None) None#

See the PTX documentation.