Changelog for CuTe DSL API changes#
4.1.0 (2025-07-16)#
for loop
Python built-in
range
now always generates IR and executes at runtimecutlass.range
is advancedrange
with IR level unrolling and pipelining controlDeprecated
cutlass.range_dynamic
, please replace withrange
orcutlass.range
Experimental Added
pipelining
control for compiler generated software pipeline code
while/if
while
/if
now by default generates IR and executes at runtime unlesscutlass.const_expr
is specified for the predicateDeprecated
cutlass.dynamic_expr
, please remove it
Rename mbarrier functions to reduce ambiguity
Modify SyncObject API (
MbarrierArray
,NamedBarrier
,TmaStoreFence
) to matchstd::barrier
Change pipeline
create
function to take only keyword arguments, and makebarrier_storage
optional.Introduce
cutlass.cute.arch.get_dyn_smem_size
api to get runtime dynamic shared memory size.Various API Support for SM100 BlockScaled Gemm
Introduce BlockScaled MmaOps in tcgen05/mma.py, and provide a
make_blockscaled_trivial_tiled_mma
function in blackwell_helpers.py to help construct a BlockScaled TiledMma.Introduce S2T CopyOps in tcgen05/copy.py.
Introduce BlockScaled layout utilities in blockscaled_layout.py for creating the required scale factor layouts in global memory, shared memory and tensor memory.
cutlass.cute.compile
now supports compilation options. Refer to JIT compilation options for more details.cutlass.cute.testing.assert_
now works for device JIT function. Specify--enable-device-assertions
as compilation option to enable.cutlass.cute.make_tiled_copy
is now deprecated. Please usecutlass.cute.make_tiled_copy_tv
instead.Shared memory capacity query
Introduce
cutlass.utils.get_smem_capacity_in_bytes
for querying the shared memory capacity.<arch>_utils.SMEM_CAPACITY["<arch_str>"]
is now deprecated.
4.0.0 (2025-06-03)#
Fixed API mismatch in class
cute.runtime.Pointer
: changeelement_type
todtype
to matchtyping.Pointer