Compilation and Export#

When a kernel function marked with @ct.kernel is launched using ct.launch(), it is specialized and compiled just in time (JIT) for the concrete launch arguments. It is also possible to compile a kernel ahead of time (AOT) and export it as a CUDA binary (cubin) file, or as a TileIR bytecode file.

While just-in-time compilation provides the convenience of automatic kernel specialization, ahead-of-time compilation requires the user to precisely describe the arguments for which the kernel is being compiled, including their types and additional constraints (assumptions) imposed on their values.

The main API entry point for ahead-of-time compilation is cuda.tile.compilation.export_kernel():

cuda.tile.compilation.export_kernel( kernel, signatures, output_file, *, gpu_code, output_format, bytecode_version=None, )#

Compile and export a kernel.

Parameters:

kernel (cuda.tile.kernel) – A kernel function to export.
signatures (Sequence[cuda.tile.compilation.KernelSignature]) – A non-empty list of signatures for which to compile the kernel.
output_file (IO | str | bytes | os.PathLike) – Either a filename or a binary file-like object to write the output to. To save the result in memory, you can pass an instance of the io.BytesIO standard library class.
gpu_code (str) – Name of the target GPU for which to compile the kernel (e.g., “sm_100”)
output_format (str) – Set to “cubin” to export a CUDA binary file, or “tileir_bytecode” to export a TileIR bytecode file.
bytecode_version (str | None) – Set to None to automatically detect the latest TileIR bytecode version supported by the compiler (default). Otherwise, it must be a string of the form “major.minor” that specifies the version of the TileIR bytecode to use (e.g., “13.1”).

Kernel Signatures#

There are two ways to construct a KernelSignature object for use with export_kernel(). The recommended way is to do it explicitly, by instantiating a KernelSignature object and providing a list of manually constructed ParameterConstraint objects.

Alternatively, one may use KernelSignature.from_kernel_args() to obtain a signature that would be used if the kernel was compiled just-in-time for the given example arguments. While convenient, this approach may create undesired assumptions on kernel parameters. For example, if the base address of an example array argument happens to be divisible by 16, an assumption may be made that it will always be so. Launching the exported kernel with an array that doesn’t satisfy this assumption would then result in undefined behavior. It is therefore recommended to limit the use of this approach to testing or prototyping.

class cuda.tile.compilation.KernelSignature(parameters, calling_convention, symbol=None)#

Signature of a compiled kernel.

Parameters:

parameters (Sequence[ParameterConstraint | bool | int | float]) – For each parameter of the kernel’s Python function, a corresponding ParameterConstraint instance. If a parameter is marked with ct.Constant, the corresponding constraint must be a ConstantConstraint or a bool, int or float value that specifies the compile-time constant. Otherwise, it must be either a ScalarConstraint, ArrayConstraint or ListConstraint. Passing a bool, int or float value as a constraint is convenience notation that is equivalent to passing an instance of ConstantConstraint that wraps said value.
calling_convention (CallingConvention) – Calling convention to use.
symbol (str | None) – Symbol name to use for the exported kernel. Set to None to automatically generate it from the Python function name and this signature, using a name mangling algorithm defined by the selected calling convention.

with_mangled_symbol(function_name)#

Returns a copy of self with the symbol attribute replaced with a mangled name.

Parameters:: function_name (str) – Function name to use as the base of the mangled symbol.
Returns:: KernelSignature
Return type:: KernelSignature

with_symbol(symbol)#

Returns a copy of self with the symbol attribute replaced with the given value.

Parameters:: symbol (str | None) – The new symbol name.
Returns:: KernelSignature
Return type:: KernelSignature

static from_kernel_args( kernel, kernel_args, calling_convention, *, symbol=None, )#

Returns the signature that would be used if the kernel was compiled just-in-time for the given arguments.o

Warning

It is recommended to limit the use of this function to testing or prototyping. Deriving a kernel signature from example arguments may create unexpected assumptions on kernel parameters.

For example, if the base address of an example array argument happens to be divisible by 16, an assumption may be made that it will always be so. Launching the exported kernel with an array that doesn’t satisfy this assumption would result in undefined behavior.

Parameters:

kernel (cuda.tile.kernel) – A kernel function decorated with @ct.kernel.
kernel_args (Sequence[Any]) – Tuple of kernel arguments, as if it were be passed to ct.launch().
calling_convention (CallingConvention) – Calling convention to use.
symbol (str | None) – Specifies the symbol attribute of the returned signature. If set to None, the returned symbol will be automatically filled using a name mangling algorithm defined by the selected calling convention.

Returns:

KernelSignature

Return type:

KernelSignature

The ParameterConstraint type alias is used as a type hint for a kernel parameter constraint:

cuda.tile.compilation.ParameterConstraint#: alias of ScalarConstraint | ArrayConstraint | ListConstraint | ConstantConstraint

class cuda.tile.compilation.ScalarConstraint(dtype)#

Describes a scalar kernel parameter and associated compile-time assumptions.

Parameters:: dtype (DType) – Data type of the scalar.

class cuda.tile.compilation.ArrayConstraint( dtype, ndim, *, index_dtype, stride_lower_bound_incl, alias_groups, may_alias_internally, stride_constant=None, stride_divisible_by=1, shape_divisible_by=1, base_addr_divisible_by=1, )#

Describes an array kernel parameter and associated compile-time assumptions.

Parameters:

dtype (DType) – Data type of the array.
ndim (int) – Number of dimensions of the array, also known as rank.
index_dtype (DType) – Data type used to represent array’s shape, strides and indices. Supported values are ct.int32 and ct.int64. Using int64 enables support for arrays whose shape or stride values exceed the range of a 32-bit integer.
stride_lower_bound_incl (Sequence[int | None] | int | None) – For each dimension of the array, an optional inclusive lower bound for its stride. If all dimensions have the same lower bound, a single number can be passed instead of a sequence. For example, passing 0 specifies that all strides are non-negative.
alias_groups (Sequence[str]) – When set to an empty sequence, specifies that this array may not alias any other parameter. Otherwise, it must be a sequence of arbitrary strings, referred to as “alias groups”. Two parameters are allowed to alias each other if and only if they have an alias group in common.
may_alias_internally (bool) – Indicates whether two distinct in-bounds indices are allowed to point to the same memory location. For example, this can happen if the array has a zero stride. For most arrays produced by major tensor libraries, this can be assumed to be false. Setting this to True may disable certain optimizations of loads and stores to/from this array.
stride_constant (Sequence[int | None] | None) – For each dimension of the array, an optional constant value of its stride. For example, if the array is known to have a C-contiguous layout, the stride of the last dimension can be set to 1, which may enable certain optimizations of loads and stores from/to this array. Can be set to None if none of the dimensions have known strides (this is the default).
stride_divisible_by (Sequence[int] | int) – For each dimension of the array, a factor by which its stride is assumed to be divisible. The value is given in array elements, not bytes. For example, a value of 8 for a float16 array indicates divisibility by 16 bytes, since each element of the array is 2 bytes wide. Value of 1 indicates that no assumption is made regarding the stride divisibility (this is the default).
shape_divisible_by (Sequence[int] | int) – For each dimension of the array, a factor by which its length is assumed to be divisible. The value is given in array elements, not bytes. For example, a value of 8 for a float16 array indicates divisibility by 16 bytes, since each element of the array is 2 bytes wide. Value of 1 indicates that no assumption is made regarding the shape divisibility (this is the default).
base_addr_divisible_by (int) – Factor by which the array’s base address is assumed to be divisible. Value of 1 indicates that no assumption is made regarding the base address divisibility (this is the default).

class cuda.tile.compilation.ListConstraint(element, *, alias_groups, elements_may_alias)#

Describes a list kernel parameter and associated compile-time assumptions.

Parameters:

element (ArrayConstraint) – Describes the element of this list. Currently, this must be an ArrayConstraint, since only lists of arrays are supported as kernel arguments.
alias_groups (Sequence[str]) – Describes which other parameters the storage of this list is allowed to alias. Note that this is different from element.alias_groups, which sets aliasing assumptions on the list elements. When set to an empty sequence, specifies that this list may not alias any other parameter. Otherwise, it must be a sequence of arbitrary strings, referred to as “alias groups”. Two parameters are allowed to alias each other if and only if they have an alias group in common.
elements_may_alias (bool) – Specifies whether two distinct elements of this list are allowed to alias each other.

class cuda.tile.compilation.ConstantConstraint(value)#

Specifies the constant value of a kernel parameter marked with ct.Constant.

Parameters:: value (bool | int | float) – The value of the compile-time constant.

Calling Conventions#

A calling convention defines three aspects of the binary interface provided by an exported kernel:

The binary format and the order of kernel arguments, e.g. as passed to the cuLaunchKernel() CUDA Driver API function.
The set of supported parameter constraints.
The name mangling algorithm used to automatically derive a symbol name from the kernel’s function name and a kernel signature.

The only currently implemented calling convention is cutile_python_v1. According to this convention, the binary kernel arguments are passed in the same order as the kernel parameters are declared in the Python kernel function, except that parameters annotated with ct.Constant are omitted. The following table lists the supported parameter constraints, as well as the corresponding binary format of the kernel arguments:

Constraint Class

Binary Format of Arguments

ScalarConstraint

Passed as a single argument of the corresponding type. For example, if the constraint’s dtype is ct.int32, the corresponding C type is int32_t; ct.float64 corresponds to C’s double and so on.

ArrayConstraint

Passed as 1 + 2n arguments, where n is the number of dimensions (ndim) of the array. The first argument is the device pointer to the base of the array’s data. It is followed by n arguments representing the shape of the array. Finally, the last n arguments represent the strides of the array. The type of shape and stride arguments is determined by the index_dtype field of the constraint:

Constraint’s `index_dtype`	C Type of Shape & Strides
`ct.int32`	`int32_t`
`ct.uint32`	`uint32_t`
`ct.int64`	`int64_t`

ListConstraint with an ArrayConstraint element

Passed as two arguments: a device pointer to the base of the list data and an int32_t denoting the length of the list. The base pointer must point to an 8-byte aligned contiguous buffer in the global GPU memory, consisting of (1 + 2n) * L 64-bit words, where L is the length of the list and n is the ndim of the element array constraint. Each element array of the list is represented by (1 + 2n) words in this buffer. The first word stores a device pointer to the base of array; the next n signed integers store the shape of the array; the final n signed integers store the strides of the array. Even though 64-bit integers are always used for the shape and the strides, they are truncated to the index_dtype of the element constraint.

ConstantConstraint

Omitted from the launch arguments.

Calling conventions are represented by the CallingConvention class:

class cuda.tile.compilation.CallingConvention#

static cutile_python_v1()#: Returns the cutile_python_v1 calling convention.