Changelog for CuTe DSL API changes#

4.3.0 (2025-10-20)#

Debuggability improvements:
- Supported source location tracking for DSL APIs
- Supported dumping PTX and CUBIN
Removed deprecated cutlass.<arch>_utils.SMEM_CAPACITY["<arch_str>"] and cutlass.utils.ampere_helpers
Supported calling nested functions without capturing variables inside dynamic control flow
Replaced usage of cute.arch.barrier in examples with corresponding APIs in pipeline
- Use pipeline.sync for simple cases like synchronizing the whole CTA
- Use pipeline.NamedBarrier to customize barriers with different participating threads and barrier id
Added new APIs repeat and repeat_as_tuple
Added new APIs make_rmem_tensor to create tensor in register memory (replace make_fragment with better naming)
Added new APIs make_rmem_tensor_like which create rmem tensor from a tensor using the same shape with compact col-major strides
Added TmemAllocator for allocating tensor memory
Updated SmemAllocator.allocate to support allocation of a single scalar value
Fixed TensorSSA.reduce to support static value as initial value
Updated docstring for following APIs to be more concise and easier to understand:
- make_layout_tv
- is_static
- PipelineAsync
- SmemAllocator
Fixed documentation for pipeline, utils and cute.math (cute.math is part of top level documentation)

4.2.0 (2025-09-10)#

Added back cute.make_tiled_copy per the request from community
Added support for explicit and implicit broadcast in TensorSSA
- cutlass.cute.TensorSSA: support broadcast_to and implicit broadcasting for binary operations.
Supported printing TensorSSA value in cutlass.cute.print_tensor
Updated cute.gemm to support all dispatch patterns and improved checks for illegal inputs
Introduced automatic kernel smem usage calculation for launch config.
Introduced per op fast-math control for math ops(e.g. exp, exp2, log2, log)
Introduced CopyReduceBulkTensorTileS2GOp in tcgen05/copy.py to support TMA Reduce.

4.1.0 (2025-07-16)#

for loop
- Python built-in range now always generates codes and executes at runtime
- cutlass.range is advanced range with kernel code level unrolling and pipelining control
- Deprecated cutlass.range_dynamic, please replace with range or cutlass.range
- Experimental Added pipelining control for compiler generated software pipeline code
while/if
- while/if now by default generates codes and executes at runtime unless cutlass.const_expr is specified for the predicate
- Deprecated cutlass.dynamic_expr, please remove it
Rename mbarrier functions to reduce ambiguity
Modify SyncObject API (MbarrierArray, NamedBarrier, TmaStoreFence) to match std::barrier
Change pipeline create function to take only keyword arguments, and make barrier_storage optional.
Introduce cutlass.cute.arch.get_dyn_smem_size api to get runtime dynamic shared memory size.
Various API Support for SM100 BlockScaled Gemm
- Introduce BlockScaled MmaOps in tcgen05/mma.py, and provide a make_blockscaled_trivial_tiled_mma function in blackwell_helpers.py to help construct a BlockScaled TiledMma.
- Introduce S2T CopyOps in tcgen05/copy.py.
- Introduce BlockScaled layout utilities in blockscaled_layout.py for creating the required scale factor layouts in global memory, shared memory and tensor memory.
cutlass.cute.compile now supports compilation options. Refer to JIT compilation options for more details.
cutlass.cute.testing.assert_ now works for device JIT function. Specify --enable-assertions as compilation option to enable.
cutlass.cute.make_tiled_copy is now deprecated. Please use cutlass.cute.make_tiled_copy_tv instead.
Shared memory capacity query
- Introduce cutlass.utils.get_smem_capacity_in_bytes for querying the shared memory capacity.
- <arch>_utils.SMEM_CAPACITY["<arch_str>"] is now deprecated.

4.0.0 (2025-06-03)#

Fixed API mismatch in class cute.runtime.Pointer: change element_type to dtype to match typing.Pointer