cpasync submodule#
- class cutlass.cute.nvgpu.cpasync.LoadCacheMode(value)#
Bases:
EnumAn enumeration for the possible cache modes of a non-bulk
cp.asyncinstruction.See the PTX documentation.
- class cutlass.cute.nvgpu.cpasync.CopyG2SOp(
- cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always,
Bases:
CopyOpNon-bulk asynchronous GMEM to SMEM Copy Operation.
See the PTX documentation.
- __init__(
- cache_mode: LoadCacheMode = cutlass._mlir.dialects.cute_nvgpu.LoadCacheMode.always,
- class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SOp(
- cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
Bases:
TmaCopyOpBulk tensor asynchrnous GMEM to SMEM Copy Operation using the TMA unit.
See the PTX documentation. This Operation uses TMA in the
.tilemode.- __init__(
- cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
- class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileG2SMulticastOp(
- cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
Bases:
TmaCopyOpBulk tensor asynchrnous multicast GMEM to SMEM Copy Operation using the TMA unit.
See the PTX documentation. This Operation uses TMA in the
.tilemode.- __init__(
- cta_group: ~cutlass.cute.nvgpu.tcgen05.mma.CtaGroup = <CtaGroup.ONE>,
- class cutlass.cute.nvgpu.cpasync.CopyBulkTensorTileS2GOp#
Bases:
TmaCopyOpBulk tensor asynchronous SMEM to GMEM Copy Operation using the TMA unit.
See the PTX documentation. This Operation uses TMA in the
.tilemode.- __init__() None#
- class cutlass.cute.nvgpu.cpasync.CopyReduceBulkTensorTileS2GOp(
- reduction_kind: cutlass._mlir.dialects.cute.ReductionOp = cutlass._mlir.dialects.cute.ReductionOp.ADD,
Bases:
TmaCopyOpBulk tensor asynchronous SMEM to GMEM Reduction Operation using the TMA unit.
See the PTX documentation. This Operation uses TMA in the
.tilemode.- __init__(
- reduction_kind: cutlass._mlir.dialects.cute.ReductionOp = cutlass._mlir.dialects.cute.ReductionOp.ADD,
- class cutlass.cute.nvgpu.cpasync.CopyDsmemStoreOp#
Bases:
CopyOpAsynchronous Store operation to DSMEM with explicit synchronization.
See the PTX documentation.
- __init__() None#
- cutlass.cute.nvgpu.cpasync.make_tiled_tma_atom(
- op: CopyBulkTensorTileG2SOp | CopyBulkTensorTileG2SMulticastOp | CopyBulkTensorTileS2GOp | CopyReduceBulkTensorTileS2GOp,
- gmem_tensor: cutlass.cute.typing.Tensor,
- smem_layout_: cutlass.cute.typing.Layout | cutlass.cute.typing.ComposedLayout,
- cta_tiler: cutlass.cute.typing.Tiler,
- num_multicast: int = 1,
- *,
- internal_type: Type[cutlass.cute.typing.Numeric] | None = None,
- loc=None,
- ip=None,
Makes a TMA Copy Atom in the
.tilemode to copy tiles of a GMEM tensor to/from SMEM buffer with the given Layout.Given
a GMEM tensor
a SMEM layout
a CTA-level Tiler
this function figures out the bulk tensor asynchronous copy instruction to use with the maximum “TMA vector length” to copy tiles of the GMEM tensor to/from an SMEM buffer with the provided layout while maintaining consistency with the provided Tiler.
This function returns two results:
the Copy Atom
a TMA tensor that maps logical coordinates of the GMEM tensor to coordinates consumed by the TMA unit. TMA tensors contain basis stride elements that enable their associated layout to compute coordinates. Like other CuTe tensors, TMA tensors can be partitioned.
- Parameters:
op (TMAOp) – The TMA Copy Operation to construct an Atom
gmem_tensor (Tensor) – The GMEM tensor involved in the Copy
smem_layout (Union[Layout, ComposedLayout]) – The SMEM layout to construct the Copy Atom, either w/ or w/o the stage mode
cta_tiler (Tiler) – The CTA Tiler to use
num_multicast (int) – The multicast factor
internal_type (Type[Numeric]) – Optional internal data type to use when the tensor data type is not supported by the TMA unit
- Returns:
A TMA Copy Atom associated with the TMA tensor
- Return type:
Tuple[atom.CopyAtom, Tensor]
- cutlass.cute.nvgpu.cpasync.tma_partition(
- atom: CopyAtom,
- cta_coord: cutlass.cute.typing.Coord,
- cta_layout: cutlass.cute.typing.Layout,
- smem_tensor: cutlass.cute.typing.Tensor,
- gmem_tensor: cutlass.cute.typing.Tensor,
- *,
- loc=None,
- ip=None,
Tiles the GMEM and SMEM tensors for the provided TMA Copy Atom.
- cutlass.cute.nvgpu.cpasync.create_tma_multicast_mask(
- cta_layout_vmnk: cutlass.cute.typing.Layout,
- cta_coord_vmnk: cutlass.cute.typing.Coord,
- mcast_mode: int,
- *,
- loc=None,
- ip=None,
Computes a multicast mask for a TMA load Copy.
- Parameters:
cta_layout_vmnk (Layout) – The VMNK layout of the cluster
cta_coord_vmnk (Coord) – The VMNK coordinate of the current CTA
mcast_mode (int) – The tensor mode in which to multicast
- Returns:
The resulting mask
- Return type:
Int16
- cutlass.cute.nvgpu.cpasync.prefetch_descriptor(
- tma_atom: CopyAtom,
- *,
- loc=None,
- ip=None,
Prefetches the TMA descriptor associated with the TMA Atom.
- cutlass.cute.nvgpu.cpasync.copy_tensormap(
- tma_atom: CopyAtom,
- tensormap_ptr: cutlass.cute.typing.Pointer,
- *,
- loc=None,
- ip=None,
Copies the tensormap held by a TMA Copy Atom to the memory location pointed to by the provided pointer.
- Parameters:
tma_atom (CopyAtom) – The TMA Copy Atom
tensormap_ptr (Pointer) – The pointer to the memory location to copy the tensormap to
- cutlass.cute.nvgpu.cpasync.update_tma_descriptor(
- tma_atom: CopyAtom,
- gmem_tensor: cutlass.cute.typing.Tensor,
- tma_desc_ptr: cutlass.cute.typing.Pointer,
- *,
- loc=None,
- ip=None,
Updates the TMA descriptor in the memory location pointed to by the provided pointer using information from a TMA Copy Atom and the provided GMEM tensor.
Specifically, the following fields of the TMA descriptor will be updated:
the GMEM tensor base address
the GMEM tensor shape
the GMEM tensor stride
Other fields of the TMA descriptor are left unchanged.
- Parameters:
tma_atom (CopyAtom) – The TMA Copy Atom
gmem_tensor (Tensor) – The GMEM tensor
tensormap_ptr (Pointer) – The pointer to the memory location of the descriptor to udpate
- cutlass.cute.nvgpu.cpasync.fence_tma_desc_acquire(
- tma_desc_ptr: cutlass.cute.typing.Pointer,
- *,
- loc=None,
- ip=None,
See the PTX documentation.
- cutlass.cute.nvgpu.cpasync.cp_fence_tma_desc_release(
- tma_desc_global_ptr: cutlass.cute.typing.Pointer,
- tma_desc_shared_ptr: cutlass.cute.typing.Pointer,
- *,
- loc=None,
- ip=None,
See the PTX documentation.
- cutlass.cute.nvgpu.cpasync.fence_tma_desc_release(*, loc=None, ip=None) None#
See the PTX documentation.