Release Notes#
1.4.0 (2026-05-26)#
This release highlights compatibility with CTK 13.3 which introduces Hopper(sm_90) GPU support, block-scaled MMA, new float dtypes, pack/unpack operations, atomic store operations, load and store with advanced indexing, tiled view with gapped or overlapped tile access, and support for large arrays and scalars. It also adds support for Python 3.14 including free-threading, star “*” expression, and frozen dataclasses, as well as integration with JAX.
CTK 13.3 features#
Support Hopper (sm_90 family) GPUs.
Add
float8_e8m0fnuandfloat4_e2m1fnrestricted float dtype.Add
ct.mma_scaled()operation for block-scaled matrix multiply-accumulate.Add
ct.pack_to_bytes()operation that flattens a tile and reinterprets its raw bytes as a 1D uint8 tile;ct.unpack_from_bytes()is the inverse ofct.pack_to_bytes().Add
ct.load_advanced_indexing()andct.store_advanced_indexing()for gathering/scattering along one dimension while slicing on other dimensions.Add
atomic_store_addand more atomic methods onTiledViewfor performing element-wise atomic read-modify-write operations on a tiled view at a given tile index.Add support for
Array.tiled_view()withtraversal_stepsfor creating a tiled space with overlapped or spaced tiles.ByTargetnow accepts adefaultvalue that applies to all architectures (e.g.ByTarget(sm_100=8, default=2)), allowing generated TileIR bytecode to be independent of the GPU architecture.Add
ct.astile()for creating a tile from a scalar (yielding a 0-d tile) or a (possibly nested) tuple of scalars whose nesting determines the tile’s shape.Support large arrays (>2B elements):
New
ct.IndexedWithInt64annotation for array kernel parameters whose shape or stride values exceed the range of a 32-bit integer. Arrays without the annotation continue to useint32for shape and stride.New
ct.ScalarInt64annotation that forces a scalar integer kernel parameter to be inferred asint64instead of the defaultint32.
Add
use_fast_accoption toct.mma()to enable fast accumulation mode for FP8 inputs (float8_e4m3fn,float8_e5m2) on Hopper GPUs.Add a new kernel hint
num_worker_warpstoct.kernel.ct.atomic_add()now supportsbfloat16operands on Hopper (sm_90) and newer architectures.Optional
rounding_modeparameter forct.exp()(supportsRoundingMode.FULLandRoundingMode.APPROXfor f32).
Python features#
Add Python 3.14 and Python 3.14t (free-threading) support.
Add support for frozen dataclasses.
Add support for passing a tuple as a starred function argument, e.g.
foo(a, *b, c).Add support for variadic parameters in user-defined functions, e.g.
def foo(*args).
Enhancements#
ct.extract()now raises a compile-timeTileTypeErrorwhen a constant index is out of bounds for the tile grid. Dynamic indices are unaffected.ct.floordiv()and the//operator now support floating-point operands.
Ecosystem#
Add support for JAX via
ct.jax.cutile_call().
1.3.0 (2026-04-20)#
Features#
Add API for ahead-of-time compilation and export via
compilation.export_kernel().
See the Compilation and Export section for more details.Add API for autotuning via
tune.exhaustive_search()and the following helpers:Add API
kernel.replace_hints()to get a new kernel with updated hints.Add API function
compiler_timeout()for temporarily setting the timeout on the tileiras compiler.
See the Autotuning section for more details.
Add API
Array.tiled_view()to create a tiled view of an array with a fixed tile shape and padding mode.
Enhancements#
Add support for specifying
memory_orderandmemory_scopeoncuda.tile.loadandcuda.tile.storeoperations.Improve
print()to handle tuple and nested fstring.
Bug Fixes#
Fix a bug where restricted float dtype with simple reduce and scan did not raise proper
TileTypeError.
ABI Changes#
Change kernel ABI convention to omit parameters annotated with
cuda.tile.Constant.
1.2.0 (2026-03-05)#
CTK 13.2 features#
Support Ampere and Ada (sm80 family) GPUs.
Support
pip install cuda-tile[tileiras]to usetileirasfrom Python environment without system-wide CTK installation.Add
ct.atan2(y, x)operation for computing the arctangent of y/x.Add optional
rounding_modeparameter forct.tanh(), supportingRoundingMode.FULLandRoundingMode.APPROX.Compiling FP8 operations for sm80 family GPUs will raise
TileUnsupportedFeatureError.Setting
opt_level=0onct.kernelis no longer required forct.printf()andct.print().
Features#
Add
ct.static_iterkeyword that enables compile-timeforloops.Add
ct.static_assertkeyword that can be used to assert that a condition is true at compile time.Add
ct.static_evalkeyword that enables compile-time evaluation using the host Python interpreter.Add
ct.scan()for custom scan.Add
ct.isnan().Add
print()andct.print()that supports python-style print and f-strings.Add optional
maskparameter toct.gather()andct.scatter()for custom boolean masking.Operator
+can now be used to concatenate tuples.Support unpacking nested tuples (e.g.,
a, (b, c) = t) and using square brackets for unpacking (e.g.,[a, b] = 1, 2).Add bytecode-to-cubin disk cache to avoid recompilation of unchanged kernels. Controlled by
CUDA_TILE_CACHE_DIRandCUDA_TILE_CACHE_SIZE.
Bug Fixes#
Fix a bug where
nan != nanreturns False.Fix “potentially undefined variable
$retval” error when a helper function returns after awhileloop that contains no early return.Fix the missing column indicator in error messages when the underlined text is only one character wide.
Add a missing check for unpacking a tuple with too many values. For example,
a, b = 1, 2, 3now raises an error instead of silently discarding the extra value.Fix a bug where the promoted dtype of uint16 and uint64 was incorrectly set to uint32.
Enhancements#
Erase the distinction between scalars and zero-dimensional tiles. They are now completely interchangeable.
~xfor const booleanxwill raise a TypeError to prevent inconsistent results compared to~xon a boolean Tile.Add
TileUnsupportedFeatureErrorto the public API.
1.1.0 (2026-01-30)#
Features#
Add support for nested functions and lambdas.
Add support for custom reduction via
ct.reduce().Add
Array.slice(axis, start, stop)to create a view of an array sliced along a single axis. The result shares memory with the original array (no data copy).
Bug Fixes#
Fix reductions with multiple axes specified in non-increasing order.
Fix a bug when pattern matching (FusedMultiplyAdd) attempts to remove a value that is used by the new operation.
Enhancements#
Allow assignments with type annotations. Type annotations are ignored.
Support constructors of built-in numeric types (bool, int, float), e.g.,
float('inf').Lift the ban on recursive helper function calls. Instead, add a limit on recursion depth. Add a new exception class
TileRecursionError, thrown at compile time when the recursion limit is reached during function call inlining.Improve error messages for type mismatches in control flow statements.
Relax type checking rules for variables that are assigned a different type depending on the branch taken: it is now only an error if the variable is used afterwards.
Stricter rules for potentially-undefined variable detection: if a variable is first assigned inside a
forloop, and then used after the loop, it is now an error because the loop may take zero iterations, resulting in a use of an undefined variable.Include a full cuTile traceback in error messages. Improve formatting of code locations; include function names, remove unnecessary characters to reduce line lengths.
Delay the loading of CUDA driver until kernel launch.
Expose the
TileErrorbase class in the public API.Add
ct.abs()for completeness.
1.0.1 (2025-12-18)#
Bug Fixes#
Fix a bug in hash function that resulted in potential performance regression for kernels with many specializations.
Fix a bug where an if statement within a loop can trigger an internal compiler error.
Fix SliceType
__eq__comparison logic.
Enhancements#
Improve error message for
ct.cat().Support
is not Nonecomparison.
1.0.0 (2025-12-02)#
Initial release.