13. Release Notes#
13.1. Known Issues#
The programming model is missing a section on a cross-tile block kernel such as split-k.
The bytecode section does not provide exact encoding of each operation, expect this to be introduced in a future release.
The semi-formal memory model section is written but does not provide detailed examples of how to to utilize it.
Atomics are currently limited in Tile IR and will be expanded in a future release.
On
sm_120, declaring anf16constant, converting it to an FP8 type, and then printing the result can cause a compiler crash.
13.2. Changelog#
13.2.1. Spec 13.3 (2026-05-27)#
Supported Architectures#
Added support for Hopper (sm_90) architecture.
New Operations#
Added
cuda_tile.allocaoperation for automatic memory allocation.Added
cuda_tile.atomic_red_view_tkooperation for view-based atomic reduction on global memory.Added
cuda_tile.make_gather_scatter_viewoperation to create a gather/scatter view from a tensor view.Added
cuda_tile.make_strided_viewoperation to create a strided view from a tensor view.Added
cuda_tile.mmaf_scaledoperation for floating-point matrix-multiply-accumulate with scaled inputs onsm_100and above.Added
cuda_tile.packoperation to pack a tile into a byte array.Added
cuda_tile.unpackoperation to unpack a byte array into a tile.
Updated Operations#
Added
bf16support to theADDFmode ofcuda_tile.atomic_rmw_tko.Added
constantattribute tocuda_tile.globalto mark globals as immutable/read-only.Added
defaultkey support for optimization hints oncuda_tile.entryas a fallback when no target-specific hint is given.Added
fast_accattribute tocuda_tile.mmaffor faster but less precise FP8 MMA accumulation on Hopper GPUs.Added
i4type support tocuda_tile.exti,cuda_tile.trunci,cuda_tile.pack, andcuda_tile.unpack.Added
num_worker_warps_per_ctaoptimization hint tocuda_tile.entry.Added
producerattribute tocuda_tile.modulefor identifying the generating tool.Added
rounding_modeattribute tocuda_tile.exp(approxandfullmodes).Added
symbol_visibilityattribute (public/private) tocuda_tile.global.Changed index types on
cuda_tile.load_view_tkoandcuda_tile.store_view_tkofrom scalar-only to support 1D tensor indices (for gather/scatter).
Features#
Added
f4E2M1FN(4-bit floating-point) type.Added
gather_scatter_viewtype for gather/scatter access patterns over tensor views.Added
i4(4-bit integer) type for quantization support.i4tiles must be converted to a supported integer type for use in operations.Added
strided_viewtype for strided tile views with configurable traversal strides.
Bugfixes#
Fixed a bug where
cuda_tile.atomic_rmw_tkowithFADDonf16could produce incorrect behavior because memory scope and memory ordering semantics were dropped during lowering.
Improved Documentation#
Added 4-bit memory layout documentation to the tensor view type.
Improved overflow and undefined-behavior documentation for
cuda_tile.ftoiandcuda_tile.itof.