Changelog for CuTe DSL API changes#
4.3.0 (2025-10-20)#
Debuggability improvements:
Supported source location tracking for DSL APIs
Supported dumping PTX and CUBIN
Removed deprecated
cutlass.<arch>_utils.SMEM_CAPACITY["<arch_str>"]andcutlass.utils.ampere_helpersSupported calling nested functions without capturing variables inside dynamic control flow
Replaced usage of
cute.arch.barrierin examples with corresponding APIs inpipelineUse
pipeline.syncfor simple cases like synchronizing the whole CTAUse
pipeline.NamedBarrierto customize barriers with different participating threads and barrier id
Added new APIs
repeatandrepeat_as_tupleAdded new APIs
make_rmem_tensorto create tensor in register memory (replacemake_fragmentwith better naming)Added new APIs
make_rmem_tensor_likewhich create rmem tensor from a tensor using the same shape with compact col-major stridesAdded
TmemAllocatorfor allocating tensor memoryUpdated
SmemAllocator.allocateto support allocation of a single scalar valueFixed
TensorSSA.reduceto support static value as initial valueUpdated docstring for following APIs to be more concise and easier to understand:
make_layout_tvis_staticPipelineAsyncSmemAllocator
Fixed documentation for
pipeline,utilsandcute.math(cute.mathis part of top level documentation)
4.2.0 (2025-09-10)#
Added back
cute.make_tiled_copyper the request from communityAdded support for explicit and implicit broadcast in
TensorSSAcutlass.cute.TensorSSA: supportbroadcast_toand implicit broadcasting for binary operations.
Supported printing
TensorSSAvalue incutlass.cute.print_tensorUpdated
cute.gemmto support all dispatch patterns and improved checks for illegal inputsIntroduced automatic kernel smem usage calculation for launch config.
Introduced per op fast-math control for math ops(e.g.
exp,exp2,log2,log)Introduced
CopyReduceBulkTensorTileS2GOpin tcgen05/copy.py to support TMA Reduce.
4.1.0 (2025-07-16)#
for loop
Python built-in
rangenow always generates codes and executes at runtimecutlass.rangeis advancedrangewith kernel code level unrolling and pipelining controlDeprecated
cutlass.range_dynamic, please replace withrangeorcutlass.rangeExperimental Added
pipeliningcontrol for compiler generated software pipeline code
while/if
while/ifnow by default generates codes and executes at runtime unlesscutlass.const_expris specified for the predicateDeprecated
cutlass.dynamic_expr, please remove it
Rename mbarrier functions to reduce ambiguity
Modify SyncObject API (
MbarrierArray,NamedBarrier,TmaStoreFence) to matchstd::barrierChange pipeline
createfunction to take only keyword arguments, and makebarrier_storageoptional.Introduce
cutlass.cute.arch.get_dyn_smem_sizeapi to get runtime dynamic shared memory size.Various API Support for SM100 BlockScaled Gemm
Introduce BlockScaled MmaOps in tcgen05/mma.py, and provide a
make_blockscaled_trivial_tiled_mmafunction in blackwell_helpers.py to help construct a BlockScaled TiledMma.Introduce S2T CopyOps in tcgen05/copy.py.
Introduce BlockScaled layout utilities in blockscaled_layout.py for creating the required scale factor layouts in global memory, shared memory and tensor memory.
cutlass.cute.compilenow supports compilation options. Refer to JIT compilation options for more details.cutlass.cute.testing.assert_now works for device JIT function. Specify--enable-device-assertionsas compilation option to enable.cutlass.cute.make_tiled_copyis now deprecated. Please usecutlass.cute.make_tiled_copy_tvinstead.Shared memory capacity query
Introduce
cutlass.utils.get_smem_capacity_in_bytesfor querying the shared memory capacity.<arch>_utils.SMEM_CAPACITY["<arch_str>"]is now deprecated.
4.0.0 (2025-06-03)#
Fixed API mismatch in class
cute.runtime.Pointer: changeelement_typetodtypeto matchtyping.Pointer