Introduction#
Overview#
CuTe DSL is a Python-based domain-specific language (DSL) designed for dynamic compilation of high-performance GPU kernels. It evolved from the C++ CUTLASS library and is now available as a decorator-based DSL.
Its primary goals are:
Zero-cost abstraction, DSL is a zero-cost abstraction thanks to Hybrid DSL approach.
Consistent with CuTe C++, allowing users to express GPU kernels with full control of the hardware.
JIT compilation for both host and GPU execution.
DLPack integration, enabling seamless interop with frameworks (e.g., PyTorch, JAX).
JIT caching, so that repeated calls to the same function benefit from cached IR modules.
Native types and type inference to reduce boilerplate and improve performance.
Optional lower-level control, offering direct access to GPU backends or specialized IR dialects.
Decorators#
CuTe DSL provides two main Python decorators for generating optimized code via dynamic compilation:
@jit— Host-side JIT-compiled functions@kernel— GPU kernel functions
Both decorators can optionally use a preprocessor that automatically expands Python control flow (loops, conditionals) into operations consumable by the underlying IR.
@jit#
Declares JIT-compiled functions that can be invoked from Python or from other CuTe DSL functions.
Decorator Parameters:
preprocessor:True(default) — Automatically translate Python flow control (e.g., loops, if-statements) into IR operations.False— No automatic expansion; Python flow control must be handled manually or avoided.
Call-site Parameters:
no_cache:True— Disables JIT caching, forcing a fresh compilation each call.False(default) — Enables caching for faster subsequent calls.
@kernel#
Defines GPU kernel functions, compiled as specialized GPU symbols through dynamic compilation.
Decorator Parameters:
preprocessor:True(default) — Automatically expands Python loops/ifs into GPU-compatible IR operations.False— Expects manual or simplified kernel implementations.
Kernel Launch Parameters:
gridSpecifies the grid size as a list of integers.blockSpecifies the block size as a list of integers.clusterSpecifies the cluster size as a list of integers.smemSpecifies the size of shared memory in bytes (integer).None(default) — Automatically calculates the kernel’s shared memory usage via utils.SmemAllocator. Recommended unless manual control is required.int— Manually specifies the size of shared memory in bytes.
Additional Kernel Launch Parameters:
fallback_clusterSpecifies the minimum-guaranteed cluster size. When set,clusterbecomes the preferred size, enabling graceful degradation when hardware cannot satisfy the preferred dimensions.None(default) — No fallback;clusteris used directly.list[int]— Three-element list [x, y, z].
max_number_threadsSpecifies the maximum thread count per block (maxntid).[0, 0, 0](default) — Auto-generate reqntid fromblock.list[int]— Three-element list [x, y, z].
min_blocks_per_mpSpecifies the minimum blocks per multiprocessor (minctasm).0(default) — No minimum occupancy hint.int— Minimum number of blocks per multiprocessor.
use_pdlEnables Programmatic Dependent Launch (PDL) to overlap dependent kernel launches in the same stream.False(default) — PDL disabled.True— PDL enabled.
cooperativeEnables cooperative kernel launch; all thread blocks launch cooperatively with grid-wide synchronization support.False(default) — Standard kernel launch.True— Cooperative kernel launch.
Calling Conventions#
Caller |
Callee |
Allowed |
Compilation/Runtime |
|---|---|---|---|
Python function |
|
✅ |
DSL runtime |
Python function |
|
❌ |
N/A (error raised) |
|
|
✅ |
Compile-time call, inlined |
|
Python function |
✅ |
Compile-time call, inlined |
|
|
✅ |
Dynamic call via GPU driver or runtime |
|
|
✅ |
Compile-time call, inlined |
|
Python function |
✅ |
Compile-time call, inlined |
|
|
❌ |
N/A (error raised) |