Changelog for CuTe DSL API changes#
4.3.0 (2025-10-07)#
Debuggability improvements: - Supported source location tracking for DSL APIs - Supported dumping PTX and SASS code
Remove deprecated
cutlass.<arch>_utils.SMEM_CAPACITY["<arch_str>"]
andcutlass.utils.ampere_helpers
Support calling nested functions without capturing variables inside dynamic control flow
Replace usage of
cute.arch.barrier
in examples with corresponding APIs inpipeline
- Usepipeline.sync
for simple cases like synchronizing the whole CTA - Usepipeline.NamedBarrier
to customize barriers with different participating threads and barrier idAdded new APIs
repeat
andrepeat_as_tuple
Added new APIs
make_rmem_tensor
to replacemake_fragment
with better namingAdded new APIs
make_rmem_tensor_like
which create rmem tensor from a tensor using the same shape with compact col-major stridesAdded
TmemAllocator
for allocating tensor memoryUpdated
SmemAllocator.allocate
to support allocation of a single scalar valueFixed
TensorSSA.reduce
to support static value as initial valueUpdated docstring for following APIs to be more concise and easier to understand: -
make_layout_tv
-is_static
-PipelineAsync
-SmemAllocator
Fixed documentation for
pipeline
,utils
andcute.math
4.2.0 (2025-09-10)#
Added back
cute.make_tiled_copy
per the request from communityAdded support for explicit and implicit broadcast in
TensorSSA
-cutlass.cute.TensorSSA
: supportbroadcast_to
and implicit broadcasting for binary operations.Supported printing
TensorSSA
value incutlass.cute.print_tensor
Updated
cute.gemm
to support all dispatch patterns and improved checks for illegal inputsIntroduced automatic kernel smem usage calculation for launch config.
Introduced per op fast-math control for math ops(e.g.
exp
,exp2
,log2
,log
)Introduced
CopyReduceBulkTensorTileS2GOp
in tcgen05/copy.py to support TMA Reduce.
4.1.0 (2025-07-16)#
for loop
Python built-in
range
now always generates codes and executes at runtimecutlass.range
is advancedrange
with kernel code level unrolling and pipelining controlDeprecated
cutlass.range_dynamic
, please replace withrange
orcutlass.range
Experimental Added
pipelining
control for compiler generated software pipeline code
while/if
while
/if
now by default generates codes and executes at runtime unlesscutlass.const_expr
is specified for the predicateDeprecated
cutlass.dynamic_expr
, please remove it
Rename mbarrier functions to reduce ambiguity
Modify SyncObject API (
MbarrierArray
,NamedBarrier
,TmaStoreFence
) to matchstd::barrier
Change pipeline
create
function to take only keyword arguments, and makebarrier_storage
optional.Introduce
cutlass.cute.arch.get_dyn_smem_size
api to get runtime dynamic shared memory size.Various API Support for SM100 BlockScaled Gemm
Introduce BlockScaled MmaOps in tcgen05/mma.py, and provide a
make_blockscaled_trivial_tiled_mma
function in blackwell_helpers.py to help construct a BlockScaled TiledMma.Introduce S2T CopyOps in tcgen05/copy.py.
Introduce BlockScaled layout utilities in blockscaled_layout.py for creating the required scale factor layouts in global memory, shared memory and tensor memory.
cutlass.cute.compile
now supports compilation options. Refer to JIT compilation options for more details.cutlass.cute.testing.assert_
now works for device JIT function. Specify--enable-device-assertions
as compilation option to enable.cutlass.cute.make_tiled_copy
is now deprecated. Please usecutlass.cute.make_tiled_copy_tv
instead.Shared memory capacity query
Introduce
cutlass.utils.get_smem_capacity_in_bytes
for querying the shared memory capacity.<arch>_utils.SMEM_CAPACITY["<arch_str>"]
is now deprecated.
4.0.0 (2025-06-03)#
Fixed API mismatch in class
cute.runtime.Pointer
: changeelement_type
todtype
to matchtyping.Pointer