Changelog for CuTe DSL API changes#
4.2.0 (2025-09-15)#
Added back
cute.make_tiled_copy
per the request from communityAdded support for explicit and implicit broadcast in
TensorSSA
-cutlass.cute.TensorSSA
: supportbroadcast_to
and implicit broadcasting for binary operations.Supported printing
TensorSSA
value incutlass.cute.print_tensor
Updated
cute.gemm
to support all dispatch patterns and improved checks for illegal inputsIntroduced automatic kernel smem usage calculation for launch config.
Introduced per op fast-math control for math ops(e.g.
exp
,exp2
,log2
,log
)Introduced
CopyReduceBulkTensorTileS2GOp
in tcgen05/copy.py to support TMA Reduce.
4.1.0 (2025-07-16)#
for loop
Python built-in
range
now always generates codes and executes at runtimecutlass.range
is advancedrange
with kernel code level unrolling and pipelining controlDeprecated
cutlass.range_dynamic
, please replace withrange
orcutlass.range
Experimental Added
pipelining
control for compiler generated software pipeline code
while/if
while
/if
now by default generates codes and executes at runtime unlesscutlass.const_expr
is specified for the predicateDeprecated
cutlass.dynamic_expr
, please remove it
Rename mbarrier functions to reduce ambiguity
Modify SyncObject API (
MbarrierArray
,NamedBarrier
,TmaStoreFence
) to matchstd::barrier
Change pipeline
create
function to take only keyword arguments, and makebarrier_storage
optional.Introduce
cutlass.cute.arch.get_dyn_smem_size
api to get runtime dynamic shared memory size.Various API Support for SM100 BlockScaled Gemm
Introduce BlockScaled MmaOps in tcgen05/mma.py, and provide a
make_blockscaled_trivial_tiled_mma
function in blackwell_helpers.py to help construct a BlockScaled TiledMma.Introduce S2T CopyOps in tcgen05/copy.py.
Introduce BlockScaled layout utilities in blockscaled_layout.py for creating the required scale factor layouts in global memory, shared memory and tensor memory.
cutlass.cute.compile
now supports compilation options. Refer to JIT compilation options for more details.cutlass.cute.testing.assert_
now works for device JIT function. Specify--enable-device-assertions
as compilation option to enable.cutlass.cute.make_tiled_copy
is now deprecated. Please usecutlass.cute.make_tiled_copy_tv
instead.Shared memory capacity query
Introduce
cutlass.utils.get_smem_capacity_in_bytes
for querying the shared memory capacity.<arch>_utils.SMEM_CAPACITY["<arch_str>"]
is now deprecated.
4.0.0 (2025-06-03)#
Fixed API mismatch in class
cute.runtime.Pointer
: changeelement_type
todtype
to matchtyping.Pointer