Limitations#

Overview#

CuTe DSL is an embedded domain-specific language within Python. It utilizes a subset of Python’s syntax to provide a streamlined programming experience. It is important to understand that CuTe DSL does NOT implement the complete Python language semantics in its JIT compilation process.

This section documents the current limitations of the CuTe DSL. While some of these limitations may be addressed in future releases, developers should be aware of them when building applications with the DSL.

Notable unsupported features#

  • GeForce RTX 50 Series support

  • RS WGMMA (The input matrix A comes from register and the input matrix B comes from shared memory)

  • Programmatic Dependent Launch (PDL)

  • narrow-precision data type support, including related tensor core instructions

  • convolutions

  • full support for ahead of time compilation

  • preferred clusters

  • CLC-based tile schedulers

  • EVT support

  • Windows support

Programming Model#

Python Native Data Types

CuTe DSL supports Python data structures when used for “meta-programming,” but these structures cannot be treated as dynamic values modifiable at runtime. For instance, lists and dictionaries can be used to configure kernel parameters during compilation or serve as containers for dynamic values, but their structure and organization cannot be altered during kernel execution.

  • Static Values:
    • Evaluated during JIT compilation phase

    • Immutable after compilation completes

    • Most Python native types (lists, tuples, dictionaries) are processed as static values

    • Primarily utilized for “meta-programming” and configuration purposes

    • Example: Lists can contain dynamic values but their structure cannot be modified during kernel execution

  • Dynamic Values:
    • Evaluated during runtime execution

    • Modifiable during execution of JIT-compiled functions

    • Only a specific subset of Python types are supported as dynamic values

    • Primitive types are automatically converted when passed as function arguments: - intInt32 (may be updated to Int64 in future releases) - boolBool - floatFloat32 (may be updated to Float64 in future releases)

The JIT compiler processes Python native types analogously to C++ template parameters. The compiled code cannot manipulate dynamic values of composite types such as lists, tuples, or dictionaries.

For example, following code doesn’t work as traditional Python program inside JIT function.

@cute.jit
def foo(a: Float32, b: Float32, i: Int32, res: cute.Tensor):
    xs = [a, b]
    # indexing list with dynamic index is not supported in CuTe DSL:
    res[0] = xs[i]

    if i == 0:
        # This will alway append Float32(3.0) to the list regardless
        # of the runtime value of `i`
        xs.append(Float32(3.0))

    for i in range_dynamic(10):
        # This only append one element to the list at compile-time
        # as loop doesn't unroll at compile-time
        xs.append(Float32(1.0))
Python Function

The DSL currently does not implement support for return values from Python functions, although this capability is planned for future releases.

Example:

@cute.jit
def foo():
    return 1  # Currently unsupported in CuTe DSL
Expression or Statement with Dependent Type

CuTe DSL implements static typing and does not support dependent types. The type of each expression must be determinable during compile time, in contrast to standard Python which implements dynamic typing.

Example illustrating functionality in Python that is not supported in the DSL:

# Valid in standard Python, but unsupported in CuTe DSL
max(int(1), float(2.0))  # => 2.0 : float
max(int(3), float(2.0))  # => 3   : int

In CuTe DSL, types are promoted. For example:

@cute.jit
def foo(a: Int32, b: Float32, res: cute.Tensor):
    res[0] = max(a, b)  # Type is automatically promoted to Float32

Following code using inlined if-else expression with dependent types is not supported in CuTe DSL:

@cute.jit
def foo(cond: Boolean, a: Int32, b: Float32, res: cute.Tensor):
    res[0] = a if cond else b
Control Flow

The DSL transforms Python control flow statements (if, for, while) during Abstract Syntax Tree (AST) processing into structured control flow in MLIR which has the same constraints as dependent types. For instance, changing type of a variable in loop body is not allowed.

  • Variables must be defined prior to the control flow statement

  • Type consistency must be maintained throughout the control flow statement

  • Don’t support early exit or return from if-else statements

Example illustrating functionality in Python that is not supported in the DSL:

@cute.jit
def foo():
    a = Int32(1)
    for i in range_dynamic(10):
        a = Float32(2)  # Changing type inside loop-body is not allowed in the DSL
Built-in Operators

The DSL transforms built-in operators like and, or, max, min, etc. into MLIR operations. They also follow the same constraints of dependent types. For instance, a and b requires a and b to be of the same type.

Comparison like == on Sequence of dynamic values is known to not produce expected result at runtime.

Object Oriented Programming

The DSL is implemented on top of Python and supports Python’s object-oriented programming (OOP) features for meta-programming at compile-time.

However, similar to other composed data types, the DSL provides limited support for OOP when objects contain dynamic values. It is strongly recommended to avoid passing dynamic values between member methods through class state in your code.

The following example illustrates functionality in Python that is not supported in the DSL without implementing the DynamicExpression protocol:

class Foo:
    def __init__(self, a: Int32):
        self.a = a

    def set_a(self, i: Int32):
        self.a = i

    def get_a(self):
        return self.a

@cute.jit
def foo(a: Int32, res: cute.Tensor):
    foo = Foo(a)
    for i in cutlass.range_dynamic(10):
        foo.set_a(i)

    # This fails to compile because `a` is assigned a local value defined within the for-loop body
    # and is not visible outside of the loop body
    res[0] = foo.get_a()

The example above fails to compile because Foo.a is assigned a local value defined within the for-loop body, which is not visible outside the loop body.

The CuTe DSL implements an internal mechanism that provides limited support for OOP patterns via protocol. As the DSL continues to evolve to support additional features, this mechanism is subject to change and is not recommended for direct use in users’ code for better portability.

CuTe Layout algebra in native Python

Entirety of CuTe Layout algebra operations and APIs require JIT compilation. These functionalities are exclusively available within JIT-compiled functions and cannot be accessed in standard Python execution environments.

Additionally, there exists a restricted set of data types that can be passed as arguments to JIT-compiled functions, which further constrains their usage in native Python contexts. Only following CuTe algebra types are supported as JIT function arguments: Tensor, Pointer, Shape, Stride, Coord and IntTuple. For Stride, we don’t support ScacledBasis from native Python Context. Unfortunately, in the first release, we don’t support passing Layout under native Python Context.

Suggestions#

For reliable and predictable results:

  • Avoid dependent types in your code

  • Implement explicit type conversion for dynamic values

  • Clearly distinguish between static (compile-time) and dynamic (runtime) values

  • Use type annotations as much as possible to help JIT compiler to identify type to avoid ambiguity

# Example demonstrating explicit typing
alpha = 1.0  # Explicitly defined as float using `1.0` instead of `1`
             #  or `float(1)`
beta = 2.0   # Explicitly defined as float
result = max(alpha, beta)  # Will correctly perform float comparison
Debugging Capabilities

Debugging tools and facilities for the Python DSL are currently more limited in comparison to the C++ API. For instance, we don’t support single-stepping through the JIT-compiled code. And lack of exception handling in JIT-compiled code makes it hard to debug in some cases.

Integration with Frameworks

Integration with certain deep learning frameworks is in early development stages and may have limitations. For instance, converting frameworking tensor to cute.Tensor is known to have overhead with 2us~3us per tensor as we convert from general DLPack protocol which offers comptibility with all frameworks.

Hashing DSL APIs and Objects

DSL APIs and Objects are sensitive to MLIR context, region or other contextual information which has no meaning cross different context. Any stateful design rely on __hash__ likely misbehave with unexpected results. An example is functools.lru_cache, which combined with @cute.jit, it may cache MLIR object from one context and use in another one.

Future Improvements#

The CuTe DSL development team is actively addressing these limitations. Upcoming releases will aim to:

  • Implement support for return values from JIT compiled functions

  • Improve support for built-in operators to handle more cases without dependent types

  • Enhance debugging capabilities and tools

  • Improve error messages with precise diagnostic information

  • Extend support for additional numeric data types

  • Improve performance of converting framework tensor to cute.Tensor with native support for different frameworks

  • Offer more user friendly benchmarking methodology

Design Limitations Likely to Remain#

The primary objective of CuTe DSL is to provide a domain-specific language for expressing complex CUDA kernels with optimal GPU performance, not to execute arbitrary Python code on GPU hardware.

The following limitations will likely remain by design:

  • Complex Data Structures as Dynamic Values: Lists, tuples, and dictionaries will continue to function as static containers. While they can store dynamic values, their structure (adding/removing elements) cannot be modified during execution of JIT-compiled functions.

  • Dependent Types: Supporting dependent types would introduce substantial complexity and adversely affect the performance characteristics of generated code.

  • CuTe Layout Algebra: We don’t have plan to extend the support of CuTe Layout Algebra under native Python Context. We are planning to extend support for data types and allow JIT function to interoperate with native Python code.