13. Release Notes#

13.1. Known Issues#

The programming model is missing a section on a cross-tile block kernel such as split-k.
The bytecode section does not provide exact encoding of each operation, expect this to be introduced in a future release.
The semi-formal memory model section is written but does not provide detailed examples of how to to utilize it.
Atomics are currently limited in Tile IR and will be expanded in a future release.
On sm_120, declaring an f16 constant, converting it to an FP8 type, and then printing the result can cause a compiler crash.

Added cuda_tile.alloca operation for automatic memory allocation.
Added cuda_tile.atomic_red_view_tko operation for view-based atomic reduction on global memory.
Added cuda_tile.make_gather_scatter_view operation to create a gather/scatter view from a tensor view.
Added cuda_tile.make_strided_view operation to create a strided view from a tensor view.
Added cuda_tile.mmaf_scaled operation for floating-point matrix-multiply-accumulate with scaled inputs on sm_100 and above.
Added cuda_tile.pack operation to pack a tile into a byte array.
Added cuda_tile.unpack operation to unpack a byte array into a tile.

Added bf16 support to the ADDF mode of cuda_tile.atomic_rmw_tko.
Added constant attribute to cuda_tile.global to mark globals as immutable/read-only.
Added default key support for optimization hints on cuda_tile.entry as a fallback when no target-specific hint is given.
Added fast_acc attribute to cuda_tile.mmaf for faster but less precise FP8 MMA accumulation on Hopper GPUs.
Added i4 type support to cuda_tile.exti, cuda_tile.trunci, cuda_tile.pack, and cuda_tile.unpack.
Added num_worker_warps_per_cta optimization hint to cuda_tile.entry.
Added producer attribute to cuda_tile.module for identifying the generating tool.
Added rounding_mode attribute to cuda_tile.exp (approx and full modes).
Added symbol_visibility attribute (public/private) to cuda_tile.global.
Changed index types on cuda_tile.load_view_tko and cuda_tile.store_view_tko from scalar-only to support 1D tensor indices (for gather/scatter).

Added f4E2M1FN (4-bit floating-point) type.
Added gather_scatter_view type for gather/scatter access patterns over tensor views.
Added i4 (4-bit integer) type for quantization support. i4 tiles must be converted to a supported integer type for use in operations.
Added strided_view type for strided tile views with configurable traversal strides.

Fixed a bug where cuda_tile.atomic_rmw_tko with FADD on f16 could produce incorrect behavior because memory scope and memory ordering semantics were dropped during lowering.

Added 4-bit memory layout documentation to the tensor view type.
Improved overflow and undefined-behavior documentation for cuda_tile.ftoi and cuda_tile.itof.