cpasync submodule#

class cutlass.cute.nvgpu.cpasync.LoadCacheMode(value)#

Bases: Enum

An enumeration for the possible cache modes of a non-bulk cp.async instruction.

class cutlass.cute.nvgpu.cpasync.CopyG2SOp( cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always, )#

Bases: CopyOp

Non-bulk asynchronous GMEM to SMEM Copy Operation.

See the PTX documentation.

__init__( cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always, ) → None#

class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SOp( cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>, )#

Bases: CopyOp

Bulk tensor asynchrnous GMEM to SMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

cta_group: CtaGroup = 1#

admissible_archs = ['sm_90', 'sm_90a', 'sm_100a', 'sm_100f']#

__init__( cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>, ) → None#

class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SMulticastOp( cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>, )#

Bases: CopyOp

Bulk tensor asynchrnous multicast GMEM to SMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

cta_group: CtaGroup = 1#

admissible_archs = ['sm_90', 'sm_90a', 'sm_100a', 'sm_100f']#

__init__( cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>, ) → None#

class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileS2GOp#

Bases: CopyOp

Bulk tensor asynchrnous SMEM to GMEM Copy Operation using the TMA unit.

See the PTX documentation. This Operation uses TMA in the .tile mode.

admissible_archs = ['sm_90', 'sm_90a', 'sm_100a', 'sm_100f']#

__init__() → None#

cutlass.cute.nvgpu.cpasync.make_tiled_tma_atom( op: CopyBulkTensorTileG2SOp | CopyBulkTensorTileG2SMulticastOp | CopyBulkTensorTileS2GOp, gmem_tensor: cutlass.cute.typing.Tensor, smem_layout: cutlass.cute.typing.Layout | cutlass._mlir.ir.register_value_caster, cta_tiler: cutlass.cute.typing.Tiler, num_multicast: int = 1, *, internal_type: Type[cutlass.cute.typing.Numeric] | None = None, loc=None, ip=None, ) → Tuple[CopyAtom, cutlass.cute.typing.Tensor]#

Makes a TMA Copy Atom in the .tile mode to copy tiles of a GMEM tensor to/from SMEM buffer with the given Layout.

Given

a GMEM tensor
a SMEM layout
a CTA-level Tiler

this function figures out the bulk tensor asynchronous copy instruction to use with the maximum “TMA vector length” to copy tiles of the GMEM tensor to/from an SMEM buffer with the provided layout and consistent with the provided Tiler.

This function returns two results:

the Copy Atom
the so-called TMA tensor used to map logical coordinates of the GMEM tensor to coordinates that the TMA unit can consume. TMA tensors have so-called basis stride elements so that the associated layout can output coordinates. Otherwise, TMA tensors can be partitioned similarly to any other CuTe tensors using the algebra.

Parameters:

op (Union[CopyBulkTensorTileG2SOp, CopyBulkTensorTileG2SMulticastOp, CopyBulkTensorTileS2GOp]) – The Copy Operation to construct an Atom for
gmem_tensor (Tensor) – The GMEM tensor involved in the Copy
smem_layout (Union[Layout, core.ComposedLayout]) – The SMEM layout to construct the Copy Atom for
cta_tiler (Tiler) – The CTA Tiler to use
num_multicast (int) – The multicast factor
internal_type (Type[Numeric]) – An optional parameter for the internal data type to use when the actual data type is not supported by the TMA unit

Returns:

A Copy Atom for this Operation and the associated TMA tensor

Return type:

Tuple[core.CopyAtom, Tensor]

cutlass.cute.nvgpu.cpasync.tma_partition( atom: CopyAtom, cta_coord: cutlass.cute.typing.Coord, cta_layout: cutlass.cute.typing.Layout, smem_tensor: cutlass.cute.typing.Tensor, gmem_tensor: cutlass.cute.typing.Tensor, *, loc=None, ip=None, ) → Tuple[cutlass.cute.typing.Tensor, cutlass.cute.typing.Tensor]#: Tiles the GMEM and SMEM tensors for the provided TMA Copy Atom.

cutlass.cute.nvgpu.cpasync.create_tma_multicast_mask( cta_layout_vmnk: cutlass.cute.typing.Layout, cta_coord_vmnk: cutlass.cute.typing.Coord, mcast_mode: int, *, loc=None, ip=None, ) → cutlass.cute.typing.Int16#

Computes a multicast mask for a TMA load Copy.

Parameters:

cta_layout_vmnk (Layout) – The VMNK layout of the cluster
cta_coord_vmnk (Coord) – The VMNK coordinate of the current CTA
mcast_mode (int) – The tensor mode in which to multicast

Returns:

The resulting mask

Return type:

Int16

cutlass.cute.nvgpu.cpasync.prefetch_descriptor( tma_atom: CopyAtom, *, loc=None, ip=None, ) → None#: Prefetches the TMA descriptor associated with the TMA Atom.

cutlass.cute.nvgpu.cpasync.copy_tensormap( tma_atom: CopyAtom, tensormap_ptr: cutlass.cute.typing.Pointer, *, loc=None, ip=None, ) → None#

Copies the tensormap held by a TMA Copy Atom to the memory location pointed to by the provided pointer.

Parameters:

tma_atom (CopyAtom) – The TMA Copy Atom
tensormap_ptr (Pointer) – The pointer to the memory location to copy the tensormap to

cutlass.cute.nvgpu.cpasync.update_tma_descriptor( tma_atom: CopyAtom, gmem_tensor: cutlass.cute.typing.Tensor, tma_desc_ptr: cutlass.cute.typing.Pointer, *, loc=None, ip=None, ) → None#

Updates the TMA descriptor in the memory location pointed to by the provided pointer using information from a TMA Copy Atom and the provided GMEM tensor.

Specifically, the following fields of the TMA descriptor will be updated:

the GMEM tensor base address
the GMEM tensor shape
the GMEM tensor stride

Other fields of the TMA descriptor are left unchanged.

Parameters:

tma_atom (CopyAtom) – The TMA Copy Atom
gmem_tensor (Tensor) – The GMEM tensor
tensormap_ptr (Pointer) – The pointer to the memory location of the descriptor to udpate

cutlass.cute.nvgpu.cpasync.fence_tma_desc_acquire( tma_desc_ptr: cutlass.cute.typing.Pointer, *, loc=None, ip=None, ) → None#: See the PTX documentation.

cutlass.cute.nvgpu.cpasync.cp_fence_tma_desc_release( tma_desc_global_ptr: cutlass.cute.typing.Pointer, tma_desc_shared_ptr: cutlass.cute.typing.Pointer, *, loc=None, ip=None, ) → None#: See the PTX documentation.

cutlass.cute.nvgpu.cpasync.fence_tma_desc_release(*, loc=None, ip=None) → None#: See the PTX documentation.