13. Release Notes#

13.1. Known Issues#

  • The programming model is missing a section on a cross-tile block kernel such as split-k.

  • The bytecode section does not provide exact encoding of each operation, expect this to be introduced in a future release.

  • The semi-formal memory model section is written but does not provide detailed examples of how to to utilize it.

  • Atomics are currently limited in Tile IR and will be expanded in a future release.

  • On sm_120, declaring an f16 constant, converting it to an FP8 type, and then printing the result can cause a compiler crash.

13.2. Changelog#

13.2.1. Spec 13.3 (2026-05-27)#

Supported Architectures#

  • Added support for Hopper (sm_90) architecture.

New Operations#

  • Added cuda_tile.alloca operation for automatic memory allocation.

  • Added cuda_tile.atomic_red_view_tko operation for view-based atomic reduction on global memory.

  • Added cuda_tile.make_gather_scatter_view operation to create a gather/scatter view from a tensor view.

  • Added cuda_tile.make_strided_view operation to create a strided view from a tensor view.

  • Added cuda_tile.mmaf_scaled operation for floating-point matrix-multiply-accumulate with scaled inputs on sm_100 and above.

  • Added cuda_tile.pack operation to pack a tile into a byte array.

  • Added cuda_tile.unpack operation to unpack a byte array into a tile.

Updated Operations#

  • Added bf16 support to the ADDF mode of cuda_tile.atomic_rmw_tko.

  • Added constant attribute to cuda_tile.global to mark globals as immutable/read-only.

  • Added default key support for optimization hints on cuda_tile.entry as a fallback when no target-specific hint is given.

  • Added fast_acc attribute to cuda_tile.mmaf for faster but less precise FP8 MMA accumulation on Hopper GPUs.

  • Added i4 type support to cuda_tile.exti, cuda_tile.trunci, cuda_tile.pack, and cuda_tile.unpack.

  • Added num_worker_warps_per_cta optimization hint to cuda_tile.entry.

  • Added producer attribute to cuda_tile.module for identifying the generating tool.

  • Added rounding_mode attribute to cuda_tile.exp (approx and full modes).

  • Added symbol_visibility attribute (public/private) to cuda_tile.global.

  • Changed index types on cuda_tile.load_view_tko and cuda_tile.store_view_tko from scalar-only to support 1D tensor indices (for gather/scatter).

Features#

  • Added f4E2M1FN (4-bit floating-point) type.

  • Added gather_scatter_view type for gather/scatter access patterns over tensor views.

  • Added i4 (4-bit integer) type for quantization support. i4 tiles must be converted to a supported integer type for use in operations.

  • Added strided_view type for strided tile views with configurable traversal strides.

Bugfixes#

  • Fixed a bug where cuda_tile.atomic_rmw_tko with FADD on f16 could produce incorrect behavior because memory scope and memory ordering semantics were dropped during lowering.

Improved Documentation#

  • Added 4-bit memory layout documentation to the tensor view type.

  • Improved overflow and undefined-behavior documentation for cuda_tile.ftoi and cuda_tile.itof.