Execution Model#
Abstract Machine#
A tile kernel is executed by logical thread blocks organized in a 1D, 2D, or 3D grid.
Each block runs on a subset of a GPU defined by the underlying compiler implementation. Every block executes the body of the kernel: scalar operations run serially on a single thread, while array operations run collectively in parallel across all threads of the block.
Tile programs express block-level parallelism only with no exposure to individual threads within the block.
Explicit synchronization or communication within a block is not permitted, but is allowed between different blocks.
A block defines the unit of execution and a tile defines unit of data, which shall not be confused. A single block may operate on multiple tiles with different shapes originating from different global arrays.
Execution Spaces#
cuTile code runs on one or more targets. A target is an execution environment defined by its hardware resources and programming model.
The set of targets where a construct can be used is called its execution space. cuTile defines three execution spaces:
Host code — all CPU targets.
SIMT code — all CUDA SIMT targets. (Historically called device code; we avoid that term to prevent ambiguity.)
Tile code — all CUDA tile targets.
Some constructs span multiple execution spaces. For example,
cdiv() is usable in both host code and tile code.
A function whose decorator explicitly specifies its execution space is called an annotated function.
Tile Functions#
- class cuda.tile.function(func=None, /, *, host=False, tile=True)#
Tile functions are functions that are usable in tile code.
This decorator indicates what execution spaces a function can be called from. With no arguments, it denotes a tile-only function.
When an unannotated function is called by a tile function, tile shall be added to the unannotated function’s execution space. This process is recursive. No explicit annotation is required.
The types usable as parameters to a tile function are described in the data model.
Tile Kernels#
- class cuda.tile.kernel(function=None, /, **kwargs)#
A tile kernel is a function executed by each block in a grid.
Functions with this decorator are kernels.
Kernels are the entry points of tile code. Their execution space shall be only tile code; they cannot be called from host code.
Kernels cannot be called directly. Instead, use
launch()to queue a kernel for execution over a grid.The types usable as parameters to a kernel are described in the data model.
- Parameters:
num_ctas – Number of CTAs in a CGA. Must be a power of 2 between 1 and 16, inclusive. Default: None (auto).
occupancy – Expected number of active CTAs per SM, [1, 32]. Default: None (auto).
opt_level – Optimization level [0, 3], default 3.
num_worker_warps – Number of warps in the CUDA core warp groups in a warp-specialized kernel. The compiler may add warps (e.g., for asynchronous memory transfers) that are not counted here. This value does not represent the total warp count. It’s worth tuning when a warp-specialized kernel has high register pressure that other approaches cannot resolve. Normalization-style kernels with large tiles are the canonical cases. Must be either 4 or 8. Default: None (auto). Since CTK 13.3. Ignored with a warning otherwise.
Target-specific values for the compiler options above can be provided using a
ByTargetobject.- replace_hints(**hints)#
Return a new kernel with updated compiler hints.
Notes:
Because hints affects compilation, the returned object will have its own JIT cache.
Examples:
@ct.kernel(occupancy=2) def kernel(): pass # compile ct.launch(torch.cuda.current_stream(), (1,), kernel, ()) # cache hit ct.launch(torch.cuda.current_stream(), (1,), kernel, ()) new_kernel = kernel.replace_hints(occupancy=4) # compile with new hints ct.launch(torch.cuda.current_stream(), (1,), new_kernel, ()) # cache hit ct.launch(torch.cuda.current_stream(), (1,), new_kernel, ())
import cuda.tile as ct import torch torch.cuda.init() stream = torch.cuda.current_stream() @ct.kernel(occupancy=2) def kernel(): pass # compile ct.launch(torch.cuda.current_stream(), (1,), kernel, ()) # cache hit ct.launch(torch.cuda.current_stream(), (1,), kernel, ()) new_kernel = kernel.replace_hints(occupancy=4) # compile with new hints ct.launch(torch.cuda.current_stream(), (1,), new_kernel, ()) # cache hit ct.launch(torch.cuda.current_stream(), (1,), new_kernel, ()) torch.cuda.synchronize()
- cuda.tile.launch(stream, grid, kernel, kernel_args, /)#
Launch a cuTile kernel.
Python Subset#
Tile code supports a subset of the Python language. There is no Python runtime within tile code.
Only Python features explicitly listed in this document are supported. Many features — such as exceptions, and coroutines — are not supported today.
Object Model & Lifetimes#
All objects created within tile code are immutable. Any operation that would conceptually modify an object instead creates and returns a new object. Attributes cannot be added to objects dynamically.
Global arrays are views that can read and write global device memory, but the views themselves are also immutable.
The caller of a kernel must ensure that:
Control Flow#
Python control flow statements (if, for, while, etc.) are usable in tile code
and can be arbitrarily nested.
Current limitations#
Tile code imposes additional restrictions on control flow:
stepmust be strictly positive.Negative-step ranges such as
range(10, 0, -1)are not supported today. Passing a negative step indirectly via a variable may cause undefined behavior.
Tile Parallelism#
When a block executes a function that takes tiles as parameters, it may parallelize evaluation across the block’s execution resources.
Constantness#
Constant Expressions & Objects#
Some facilities require parameters whose values are known at compile time. Constant expressions produce constant objects suitable for such parameters. Constant expressions are:
A literal object.
Integer arithmetic expressions where all the operands are literal objects.
A local object or parameter that is assigned from a literal object or constant expression.
A global object that is defined at the time of compilation or launch.
By default, numeric constants are loosely typed: integer constants have infinite precision and floating-point constants are stored in IEEE 754 double precision, until used in a context that requires a specific-width type.
A strictly typed constant is created by calling a dtype constructor,
e.g. ct.int16(5). Combining a strictly typed constant with a loosely typed
constant yields a strictly typed constant:
ct.int16(5) + 2 produces a strictly typed int16 constant 7.
Combining two strictly typed constants also produces a strictly typed constant,
with the regular type promotion rules applied.
For example, ct.int16(5) + ct.int32(7) produces a strictly typed int32
constant 12.
Constant Embedding#
If a kernel parameter is constant embedded, then:
Every use of the parameter behaves as if replaced by its literal value.
A distinct machine representation of the kernel is generated for each unique value of the parameter. Note: The kernel is compiled once per unique value, even if JIT caching is enabled.
Constant Type Hints#
import cuda.tile as ct
def needs_constant(x: ct.Constant):
pass
def needs_constant_int(x: ct.Constant[int]):
pass
- class cuda.tile.ConstantAnnotation#
A
typing.Annotatedmetadata class indicating that an object shall be constant embedded.If an object of this class is passed as a metadata argument to a
typing.Annotatedtype hint on a parameter, then the parameter shall be a constant embedded.
- cuda.tile.Constant#
A type hint indicating that a value shall be constant embedded. It can be used either with (
Constant[int]) or without (Constant, meaning a constant of any type) an underlying type hint.alias of
Annotated[T, ConstantAnnotation()]
- cuda.tile.IndexedWithInt64#
A type hint indicating that an array uses i64 for shape and stride.
Example:
@ct.kernel def my_kernel(big: ct.IndexedWithInt64, small): # big.shape[i] is i64, small.shape[i] is i32 ...
alias of
Annotated[T,ArrayAnnotation(index_dtype=int64)]
- cuda.tile.ScalarInt64#
A type hint indicating that a scalar integer parameter uses int64.
By default, integer kernel parameters are inferred as int32. Use this annotation to force int64 inference for parameters that need the wider range.
Example:
@ct.kernel def my_kernel(small_int, large_int: ct.ScalarInt64): # small_int is inferred as int32 # large_int is inferred as int64 ...
alias of
Annotated[T,ScalarAnnotation(dtype=int64)]