cpasync submodule#
- class cutlass.cute.nvgpu.cpasync.LoadCacheMode(value)#
Bases:
Enum
An enumeration for the possible cache modes of a non-bulk
cp.async
instruction.See the PTX documentation.
- class cutlass.cute.nvgpu.cpasync.CopyG2SOp(
- cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always,
Bases:
CopyOp
Non-bulk asynchronous GMEM to SMEM Copy Operation.
See the PTX documentation.
- __init__(
- cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always,
- class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SOp(
- cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
Bases:
CopyOp
Bulk tensor asynchrnous GMEM to SMEM Copy Operation using the TMA unit.
See the PTX documentation. This Operation uses TMA in the
.tile
mode.- admissible_archs = ['sm_90', 'sm_90a', 'sm_100a']#
- __init__(
- cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
- class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SMulticastOp(
- cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
Bases:
CopyOp
Bulk tensor asynchrnous multicast GMEM to SMEM Copy Operation using the TMA unit.
See the PTX documentation. This Operation uses TMA in the
.tile
mode.- admissible_archs = ['sm_90', 'sm_90a', 'sm_100a']#
- __init__(
- cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
- class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileS2GOp#
Bases:
CopyOp
Bulk tensor asynchrnous SMEM to GMEM Copy Operation using the TMA unit.
See the PTX documentation. This Operation uses TMA in the
.tile
mode.- admissible_archs = ['sm_90', 'sm_90a', 'sm_100a']#
- __init__() None #
- cutlass.cute.nvgpu.cpasync.make_tma_tile_atom(
- op: CopyBulkTensorTileG2SOp | CopyBulkTensorTileG2SMulticastOp | CopyBulkTensorTileS2GOp,
- gmem_tensor: cutlass.cute.typing.Tensor,
- smem_layout: cutlass.cute.typing.Layout,
- cta_tiler: cutlass.cute.typing.Tiler,
- num_multicast: int = 1,
- *,
- internal_type: Type[cutlass.cute.typing.Numeric] | None = None,
- loc=None,
- ip=None,
Makes a TMA Copy Atom in the
.tile
mode to copy tiles of a GMEM tensor to/from and SMEM buffer with the given Layout.Given
a GMEM tensor
a SMEM layout
a CTA-level Tiler
this function figures out the bulk tensor asynchronous copy instruction to use with the maximum “TMA vector length” to copy tiles of the GMEM tensor to/from an SMEM buffer with the provided layout and consistent with the provided Tiler.
This function returns two results:
the Copy Atom
the so-called TMA tensor used to map logical coordinates of the GMEM tensor to coordinates that the TMA unit can consume. TMA tensors have so-called basis stride elements so that the associated layout can output coordinates. Otherwise, TMA tensors can be partitioned similarly to any other CuTe tensors using the algebra.
- Parameters:
op (Union[CopyBulkTensorTileG2SOp, CopyBulkTensorTileG2SMulticastOp, CopyBulkTensorTileS2GOp]) – The Copy Operation to construct an Atom for
gmem_tensor (Tensor) – The GMEM tensor involved in the Copy
smem_layout (Layout) – The SMEM layout to construct the Copy Atom for
cta_tiler (Tiler) – The CTA Tiler to use
num_multicast (int) – The multicast factor
internal_type (Type[Numeric]) – An optional parameter for the internal data type to use when the actual data type is not supported by the TMA unit
- Returns:
A Copy Atom for this Operation and the associated TMA tensor
- Return type:
Tuple[core.CopyAtom, Tensor]
- cutlass.cute.nvgpu.cpasync.tma_partition(
- atom: CopyAtom,
- cta_coord: cutlass.cute.typing.Coord,
- cta_layout: cutlass.cute.typing.Layout,
- smem_tensor: cutlass.cute.typing.Tensor,
- gmem_tensor: cutlass.cute.typing.Tensor,
- *,
- loc=None,
- ip=None,
Tiles the GMEM and SMEM tensors for the provided TMA Copy Atom.
- cutlass.cute.nvgpu.cpasync.create_tma_multicast_mask(
- cta_layout_vmnk: cutlass.cute.typing.Layout,
- cta_coord_vmnk: cutlass.cute.typing.Coord,
- mcast_mode: int,
- *,
- loc=None,
- ip=None,
Computes a multicast mask for a TMA load Copy.
- Parameters:
cta_layout_vmnk (Layout) – The VMNK layout of the cluster
cta_coord_vmnk (Coord) – The VMNK coordinate of the current CTA
mcast_mode (int) – The tensor mode in which to multicast
- Returns:
The resulting mask
- Return type:
Int16
- cutlass.cute.nvgpu.cpasync.prefetch_descriptor(
- tma_atom: CopyAtom,
- *,
- loc=None,
- ip=None,
Prefetches the TMA descriptor associated with the TMA Atom.
- cutlass.cute.nvgpu.cpasync.copy_tensormap(
- tma_atom: CopyAtom,
- tensormap_ptr: cutlass.cute.typing.Pointer,
- *,
- loc=None,
- ip=None,
Copies the tensormap held by a TMA Copy Atom to the memory location pointed to by the provided pointer.
- Parameters:
tma_atom (CopyAtom) – The TMA Copy Atom
tensormap_ptr (Pointer) – The pointer to the memory location to copy the tensormap to
- cutlass.cute.nvgpu.cpasync.update_tma_descriptor(
- tma_atom: CopyAtom,
- gmem_tensor: cutlass.cute.typing.Tensor,
- tma_desc_ptr: cutlass.cute.typing.Pointer,
- *,
- loc=None,
- ip=None,
Updates the TMA descriptor in the memory location pointed to by the provided pointer using information from a TMA Copy Atom and the provided GMEM tensor.
Specifically, the following fields of the TMA descriptor will be updated:
the GMEM tensor base address
the GMEM tensor shape
the GMEM tensor stride
Other fields of the TMA descriptor are left unchanged.
- Parameters:
tma_atom (CopyAtom) – The TMA Copy Atom
gmem_tensor (Tensor) – The GMEM tensor
tensormap_ptr (Pointer) – The pointer to the memory location of the descriptor to udpate
- cutlass.cute.nvgpu.cpasync.fence_tma_desc_acquire(
- tma_desc_ptr: cutlass.cute.typing.Pointer,
- *,
- loc=None,
- ip=None,
See the PTX documentation.
- cutlass.cute.nvgpu.cpasync.cp_fence_tma_desc_release(
- tma_desc_global_ptr: cutlass.cute.typing.Pointer,
- tma_desc_shared_ptr: cutlass.cute.typing.Pointer,
- *,
- loc=None,
- ip=None,
See the PTX documentation.
- cutlass.cute.nvgpu.cpasync.fence_tma_desc_release(*, loc=None, ip=None) None #
See the PTX documentation.