End-to-End Code Generation#

1. Techniques for Turning Python into intermediate representation (IR)#

1.1 AST rewrite#

The function’s abstract-syntax tree is analysed before execution. Python control-flow (for/while, if/else) and built-ins are converted to structured intermediate representation (IR) constructs. Computation inside each region is left untouched at this stage.

Advantages

  • Sees the entire program, so every branch and loop is preserved.

  • Keeps loop structure intact for optimization such as tiling, vectorisation or GPU thread mapping.

Disadvantages

  • Requires a well-defined Python subset that the rewriter understands.

1.2 Tracing#

The decorated function is executed once with proxy arguments; overloaded operators record every tensor operation that actually runs and produce a flat trace that is lowered to intermediate representation (IR).

Advantages

  • Near-zero compile latency, ideal for straight-line arithmetic.

  • No need to parse Python source, so it supports many dynamic Python features, and Python has many features.

Disadvantages

  • Untaken branches vanish, so the generated kernel may be wrong for other inputs.

  • Loops are flattened to the iteration count observed during tracing.

  • Data-dependent control-flow freezes to a single execution path.

2. CuTe DSL Code-Generation Modes#

CuTe’s Python front-end combines the techniques above into two mutually exclusive modes, selectable with the preprocessor flag of the @jit decorator:

1. Tracing mode @jit(preprocess=False) – tracing only. This results in the fastest compilation path and is recommended only for kernels that are guaranteed to be straight-line arithmetic. It suffers from all tracing limitations listed in the previous section.

2. Preprocessor mode (default) @jit(preprocess=True)AST rewrite + tracing. The AST pass captures every loop and branch, eliminating the correctness and optimisation problems of pure tracing; tracing then fills in the arithmetic. This hybrid “preprocessor” pipeline is unique to CuTe DSL and was designed specifically to overcome the disadvantages identified above.

../../../../_images/dsl_modes.png

Figure 1 Left: tracing mode records only the path that executed. Right: preprocessor mode emits structured intermediate representation (IR) for every branch and loop before tracing the arithmetic.#

Why Tracing-Only Is Insufficient for Control-Flow#

  • Branch loss – The untaken side of an if/else is never lowered.

  • Loop unrolling – Loops are flattened to the iteration count observed, destroying structure needed for parallel mapping and tiling.

  • Data-dependent paths – Control-flow that depends on tensor values freezes to a single execution path at trace time.

The preprocessor mode fixes all of these by lowering control-flow first and delegating only the arithmetic to the tracer.