Changelog for CuTe DSL API changes#
4.2.0 (2025-09-15)#
Added back
cute.make_tiled_copyper the request from communityAdded support for explicit and implicit broadcast in
TensorSSA-cutlass.cute.TensorSSA: supportbroadcast_toand implicit broadcasting for binary operations.Supported printing
TensorSSAvalue incutlass.cute.print_tensorUpdated
cute.gemmto support all dispatch patterns and improved checks for illegal inputsIntroduced automatic kernel smem usage calculation for launch config.
Introduced per op fast-math control for math ops(e.g.
exp,exp2,log2,log)Introduced
CopyReduceBulkTensorTileS2GOpin tcgen05/copy.py to support TMA Reduce.
4.1.0 (2025-07-16)#
for loop
Python built-in
rangenow always generates codes and executes at runtimecutlass.rangeis advancedrangewith kernel code level unrolling and pipelining controlDeprecated
cutlass.range_dynamic, please replace withrangeorcutlass.rangeExperimental Added
pipeliningcontrol for compiler generated software pipeline code
while/if
while/ifnow by default generates codes and executes at runtime unlesscutlass.const_expris specified for the predicateDeprecated
cutlass.dynamic_expr, please remove it
Rename mbarrier functions to reduce ambiguity
Modify SyncObject API (
MbarrierArray,NamedBarrier,TmaStoreFence) to matchstd::barrierChange pipeline
createfunction to take only keyword arguments, and makebarrier_storageoptional.Introduce
cutlass.cute.arch.get_dyn_smem_sizeapi to get runtime dynamic shared memory size.Various API Support for SM100 BlockScaled Gemm
Introduce BlockScaled MmaOps in tcgen05/mma.py, and provide a
make_blockscaled_trivial_tiled_mmafunction in blackwell_helpers.py to help construct a BlockScaled TiledMma.Introduce S2T CopyOps in tcgen05/copy.py.
Introduce BlockScaled layout utilities in blockscaled_layout.py for creating the required scale factor layouts in global memory, shared memory and tensor memory.
cutlass.cute.compilenow supports compilation options. Refer to JIT compilation options for more details.cutlass.cute.testing.assert_now works for device JIT function. Specify--enable-device-assertionsas compilation option to enable.cutlass.cute.make_tiled_copyis now deprecated. Please usecutlass.cute.make_tiled_copy_tvinstead.Shared memory capacity query
Introduce
cutlass.utils.get_smem_capacity_in_bytesfor querying the shared memory capacity.<arch>_utils.SMEM_CAPACITY["<arch_str>"]is now deprecated.
4.0.0 (2025-06-03)#
Fixed API mismatch in class
cute.runtime.Pointer: changeelement_typetodtypeto matchtyping.Pointer