Compilation and Export#
When a kernel function marked with @ct.kernel is launched
using ct.launch(), it is specialized and compiled just in time (JIT)
for the concrete launch arguments. It is also possible to compile a kernel ahead of time (AOT)
and export it as a CUDA binary (cubin) file, or as a TileIR bytecode file.
While just-in-time compilation provides the convenience of automatic kernel specialization, ahead-of-time compilation requires the user to precisely describe the arguments for which the kernel is being compiled, including their types and additional constraints (assumptions) imposed on their values.
The main API entry point for ahead-of-time compilation
is cuda.tile.compilation.export_kernel():
- cuda.tile.compilation.export_kernel(
- kernel,
- signatures,
- output_file,
- *,
- gpu_code,
- output_format,
- bytecode_version=None,
Compile and export a kernel.
- Parameters:
kernel (cuda.tile.kernel) – A kernel function to export.
signatures (Sequence[cuda.tile.compilation.KernelSignature]) – A non-empty list of signatures for which to compile the kernel.
output_file (IO | str | bytes | os.PathLike) – Either a filename or a binary file-like object to write the output to. To save the result in memory, you can pass an instance of the io.BytesIO standard library class.
gpu_code (str) – Name of the target GPU for which to compile the kernel (e.g., “sm_100”)
output_format (str) – Set to “cubin” to export a CUDA binary file, or “tileir_bytecode” to export a TileIR bytecode file.
bytecode_version (str | None) – Set to None to automatically detect the latest TileIR bytecode version supported by the compiler (default). Otherwise, it must be a string of the form “major.minor” that specifies the version of the TileIR bytecode to use (e.g., “13.1”).
Kernel Signatures#
There are two ways to construct a KernelSignature object for use with
export_kernel(). The recommended way is to do it explicitly, by instantiating
a KernelSignature object and providing a list of manually constructed
ParameterConstraint objects.
Alternatively, one may use KernelSignature.from_kernel_args() to obtain a signature
that would be used if the kernel was compiled just-in-time for the given example arguments.
While convenient, this approach may create undesired assumptions on kernel parameters. For example,
if the base address of an example array argument happens to be divisible by 16, an assumption may
be made that it will always be so. Launching the exported kernel with an array that doesn’t
satisfy this assumption would then result in undefined behavior. It is therefore recommended to limit
the use of this approach to testing or prototyping.
- class cuda.tile.compilation.KernelSignature(parameters, calling_convention, symbol=None)#
Signature of a compiled kernel.
- Parameters:
parameters (Sequence[ParameterConstraint | bool | int | float]) – For each parameter of the kernel’s Python function, a corresponding
ParameterConstraintinstance. If a parameter is marked withct.Constant, the corresponding constraint must be aConstantConstraintor abool,intorfloatvalue that specifies the compile-time constant. Otherwise, it must be either aScalarConstraint,ArrayConstraintorListConstraint. Passing abool,intorfloatvalue as a constraint is convenience notation that is equivalent to passing an instance ofConstantConstraintthat wraps said value.calling_convention (CallingConvention) – Calling convention to use.
symbol (str | None) – Symbol name to use for the exported kernel. Set to None to automatically generate it from the Python function name and this signature, using a name mangling algorithm defined by the selected calling convention.
- with_mangled_symbol(function_name)#
Returns a copy of self with the symbol attribute replaced with a mangled name.
- Parameters:
function_name (str) – Function name to use as the base of the mangled symbol.
- Returns:
KernelSignature
- Return type:
- with_symbol(symbol)#
Returns a copy of self with the symbol attribute replaced with the given value.
- Parameters:
symbol (str | None) – The new symbol name.
- Returns:
KernelSignature
- Return type:
- static from_kernel_args(
- kernel,
- kernel_args,
- calling_convention,
- *,
- symbol=None,
Returns the signature that would be used if the kernel was compiled just-in-time for the given arguments.o
Warning
It is recommended to limit the use of this function to testing or prototyping. Deriving a kernel signature from example arguments may create unexpected assumptions on kernel parameters.
For example, if the base address of an example array argument happens to be divisible by 16, an assumption may be made that it will always be so. Launching the exported kernel with an array that doesn’t satisfy this assumption would result in undefined behavior.
- Parameters:
kernel (cuda.tile.kernel) – A kernel function decorated with
@ct.kernel.kernel_args (Sequence[Any]) – Tuple of kernel arguments, as if it were be passed to
ct.launch().calling_convention (CallingConvention) – Calling convention to use.
symbol (str | None) – Specifies the symbol attribute of the returned signature. If set to None, the returned symbol will be automatically filled using a name mangling algorithm defined by the selected calling convention.
- Returns:
KernelSignature
- Return type:
The ParameterConstraint type alias is used as a type hint for a kernel parameter
constraint:
- cuda.tile.compilation.ParameterConstraint#
alias of
ScalarConstraint|ArrayConstraint|ListConstraint|ConstantConstraint
- class cuda.tile.compilation.ScalarConstraint(dtype)#
Describes a scalar kernel parameter and associated compile-time assumptions.
- Parameters:
dtype (DType) – Data type of the scalar.
- class cuda.tile.compilation.ArrayConstraint(
- dtype,
- ndim,
- *,
- index_dtype,
- stride_lower_bound_incl,
- alias_groups,
- may_alias_internally,
- stride_constant=None,
- stride_divisible_by=1,
- shape_divisible_by=1,
- base_addr_divisible_by=1,
Describes an array kernel parameter and associated compile-time assumptions.
- Parameters:
dtype (DType) – Data type of the array.
ndim (int) – Number of dimensions of the array, also known as rank.
index_dtype (DType) – Data type used to represent array’s shape, strides and indices. Currently, only
ct.int32is supported.stride_lower_bound_incl (Sequence[int | None] | int | None) – For each dimension of the array, an optional inclusive lower bound for its stride. If all dimensions have the same lower bound, a single number can be passed instead of a sequence. For example, passing 0 specifies that all strides are non-negative.
alias_groups (Sequence[str]) – When set to an empty sequence, specifies that this array may not alias any other parameter. Otherwise, it must be a sequence of arbitrary strings, referred to as “alias groups”. Two parameters are allowed to alias each other if and only if they have an alias group in common.
may_alias_internally (bool) – Indicates whether two distinct in-bounds indices are allowed to point to the same memory location. For example, this can happen if the array has a zero stride. For most arrays produced by major tensor libraries, this can be assumed to be false. Setting this to True may disable certain optimizations of loads and stores to/from this array.
stride_constant (Sequence[int | None] | None) – For each dimension of the array, an optional constant value of its stride. For example, if the array is known to have a C-contiguous layout, the stride of the last dimension can be set to 1, which may enable certain optimizations of loads and stores from/to this array. Can be set to None if none of the dimensions have known strides (this is the default).
stride_divisible_by (Sequence[int] | int) – For each dimension of the array, a factor by which its stride is assumed to be divisible. The value is given in array elements, not bytes. For example, a value of 8 for a float16 array indicates divisibility by 16 bytes, since each element of the array is 2 bytes wide. Value of 1 indicates that no assumption is made regarding the stride divisibility (this is the default).
shape_divisible_by (Sequence[int] | int) – For each dimension of the array, a factor by which its length is assumed to be divisible. The value is given in array elements, not bytes. For example, a value of 8 for a float16 array indicates divisibility by 16 bytes, since each element of the array is 2 bytes wide. Value of 1 indicates that no assumption is made regarding the shape divisibility (this is the default).
base_addr_divisible_by (int) – Factor by which the array’s base address is assumed to be divisible. Value of 1 indicates that no assumption is made regarding the base address divisibility (this is the default).
- class cuda.tile.compilation.ListConstraint(element, *, alias_groups, elements_may_alias)#
Describes a list kernel parameter and associated compile-time assumptions.
- Parameters:
element (ArrayConstraint) – Describes the element of this list. Currently, this must be an ArrayConstraint, since only lists of arrays are supported as kernel arguments.
alias_groups (Sequence[str]) – Describes which other parameters the storage of this list is allowed to alias. Note that this is different from
element.alias_groups, which sets aliasing assumptions on the list elements. When set to an empty sequence, specifies that this list may not alias any other parameter. Otherwise, it must be a sequence of arbitrary strings, referred to as “alias groups”. Two parameters are allowed to alias each other if and only if they have an alias group in common.elements_may_alias (bool) – Specifies whether two distinct elements of this list are allowed to alias each other.
- class cuda.tile.compilation.ConstantConstraint(value)#
Specifies the constant value of a kernel parameter marked with
ct.Constant.- Parameters:
value (bool | int | float) – The value of the compile-time constant.
Calling Conventions#
A calling convention defines three aspects of the binary interface provided by an exported kernel:
The binary format and the order of kernel arguments, e.g. as passed to the
cuLaunchKernel()CUDA Driver API function.The set of supported parameter constraints.
The name mangling algorithm used to automatically derive a symbol name from the kernel’s function name and a kernel signature.
The only currently implemented calling convention is cutile_python_v1.
According to this convention, the binary kernel arguments are passed in the same order
as the kernel parameters are declared in the Python kernel function, except that parameters
annotated with ct.Constant are omitted. The following table
lists the supported parameter constraints, as well as the corresponding binary format
of the kernel arguments:
Constraint Class |
Binary Format of Arguments |
|---|---|
Passed as a single argument of the corresponding type. For example, if the constraint’s
dtype is |
|
Passed as 1 + 2n arguments, where n is the number of dimensions (ndim) of the array.
The first argument is the device pointer to the base of the array’s data. It is followed
by n arguments representing the shape of the array. Finally, the last n arguments
represent the strides of the array.
The type of shape and stride arguments is determined by the |
|
|
Passed as two arguments: a device pointer to the base of the list data and an |
Omitted from the launch arguments. |
Calling conventions are represented by the CallingConvention class: