10. Stability#

Tile IR provides a set of guarantees regarding portability, stability, and compatibility to ensure predictable behavior across different platforms, toolchains, and hardware targets. These guarantees are documented below.

Definitions:

Stability: An unchanging property of a program or interface.
Portability: A property of a program to be transfered to a different hardware or toolchain version with the same behavior.
Compatibility: A property of a program to be executed on a different platform or toolchain with the same behavior.
Toolchain: Either the compiler and CTK version used to perform ahead of time compilation or, the driver and CTK version used to perfomr JIT compilation of a Tile IR program.
Pure: A property of an operation indicating that it has no observable side effects: its result depends only on its operands, and executing it does not read from or write to memory or interact with the environment. A pure operation can be freely reordered, duplicated, or eliminated by the compiler when its result is unused, and identical pure operations with the same operands always produce the same result. This matches the semantics of MLIR’s Pure trait.

10.1. Platform & Compatibility Guarantees#

Bytecode Stability The Tile IR bytecode format ensures that programs can be interpreted and loaded by all conforming drivers (see Binary Format).

Program Portability A program conforming to Tile IR vX.Y is syntactically portable to any platform that advertises support for vX.Y or newer.

Portability does not imply availability of target-specific features on all targets: if a program uses a feature that the selected hardware target does not support, the compiler will either:

diagnose the incompatibility

or apply a lowering that preserves the semantics defined by the specification

CUDA Compatibility Tile IR respects the CUDA platform’s forward and backward minor-version compatibility rules for toolchain and driver integration (see CUDA Minor Version Compatibility Rules).

10.2. Supported Architectures#

Tile IR bytecode programs are portable across all supported architectures. A single bytecode file can be compiled to any supported target or JIT-compiled by the driver at load time.

For ahead-of-time compilation, the target architecture is specified using the --gpu-name flag with a supported NVDIA GPU architecture compute capability (CC) number (e.g., tileiras --gpu-name sm_80). For JIT compilation, the driver automatically selects the architecture of the target device. The following table lists the architectures supported by Tile IR and the corresponding --gpu-name values.

Supported Architectures#
Family	Compute Capability	Example GPUs	Since
Ampere	sm_80	A100, A30	Tile IR 13.2
Ampere	sm_86	A40, RTX 3090	Tile IR 13.2
Ampere	sm_87, sm_88	Jetson Orin	Tile IR 13.2
Ada	sm_89	L40, RTX 4090	Tile IR 13.2
Hopper	sm_90	H100, H200	Tile IR 13.3
Blackwell	sm_100	B200	Tile IR 13.1
Blackwell	sm_120	RTX 5090, RTX PRO 6000	Tile IR 13.1

10.3. Feature Availability & Emulation#

Target-specific Features Tile IR may introduce new target-specific features (e.g., new datatypes, new operations) over time.

Availability: A feature introduced in vX.Y becomes usable on a hardware target starting with the first platform release that declares support for it.
Fallback: If a program uses a feature unsupported by the selected hardware target, the compiler will either diagnose the incompatibility or apply a lowering (emulation) that preserves semantics as defined by the specification.

Note that certain types have more restricted usage than others. See Element Types for details.

Warning

During the 13.x release cycle, we are bringing up existing hardware targets which may introduce new features on old targets. This “cold start” period is an exception; normally, new features will only appear in new targets.

Note

Today the only target-specific features are specific datatypes.

Hardware Support Matrix#
Data Type	Ampere	Ada	Hopper	Blackwell
i1, i8, i16, i32, i64	Supported	Supported	Supported	Supported
f16, f32, f64	Supported	Supported	Supported	Supported
bf16, tf32	Supported	Supported	Supported	Supported
f8E3M4, f8E5M2	Not Supported	Not Supported	Supported	Supported
f4E2M1FN	Not Supported	Not Supported	Not Supported	Supported

Emulation

To maintain portability, Tile IR may emulate operations on hardware targets that lack native support. For example, 16-bit floating-point operations may be emulated using 32-bit instructions if the target does not support them natively.

10.4. Execution & Numerical Guarantees#

Execution Determinism For a fixed toolchain, configuration, and hardware target, compilation and execution are deterministic within a single tile-block thread.

Version Changes: Using a different toolchain version may produce a different program and thus different results; this is expected behavior, this is expected and not non-determinism.

Numerical Stability Tile IR does not guarantee bit‑identical numerical results across different toolchain versions, configurations, or targets, except where explicitly documented.

Scope: Stability guarantees are scoped to specific versions and targets.
Updates: Changes are not retroactive; compiling/executing with an earlier toolchain retains the guarantees published for that version.

Floating-point Semantics Floating‑point operations follow applicable IEEE semantics for the order in which they are actually evaluated.

Transformations: Compiler transformations (e.g., reordering) can change numeric results across versions.
Precision: Operations like MMA may have weaker or no guarantees of bit-identical numerical results unless explicitly documented. See MMA Numerical Stability below for details.

MMA Numerical Stability

Matrix-multiply-accumulate operations (cuda_tile.mmaf, cuda_tile.mmaf_scaled, cuda_tile.mmai) are exempt from the bit-identity guarantees that apply to most floating-point operations. The compiler maps them onto the hardware Tensor Cores (TC) whenever it can, and the mapping is shaped by the surrounding program — tile sizes, target architecture, and (for FP8 mmaf on Hopper) the fast_acc attribute. Programs that change any of these inputs should expect numerical results to shift, even when the abstract computation is unchanged.

The following are the most common sources of MMA numerical variation that users encounter:

Tile size affects whether Tensor Cores are used. If a configured tile shape is too small or does not align with a supported TC shape, the compiler may emit a non-TC implementation (typically FFMA on the CUDA cores). Tensor Cores and FFMA accumulate in different orders and at different intermediate precisions, so the results are generally not bit-identical.
Tile size affects how Tensor Cores are invoked. Even when Tensor Cores are selected, multiple shapes and accumulation strategies (e.g., different M×N×K shapes, or different numbers of K splits per output tile) can implement the same abstract MMA. The compiler may choose differently as tile sizes change, which changes the order of partial-sum accumulation and therefore the final value.
Target architecture affects internal accumulation precision. Different generations of Tensor Cores may use different internal accumulation precisions for the same input/output element types. Compiling the same program for a different target can therefore produce different results. For FP8 MMA on Hopper, the fast_acc attribute on cuda_tile.mmaf selects between native Hopper FP8 accumulation (faster, with an intermediate precision wider than F16 but narrower than F32) and periodic promotion to higher precision (slower, more accurate); this choice is reflected directly in the numerical result.

For a fixed input program, hardware target, and toolchain version (driver, CTK, and compiler), Tile IR produces the same MMA result on every run. Changing any of those three — modifying the program (including its tile shapes or fast_acc selection), targeting a different architecture, or upgrading the toolchain — may change the numerical result.