11. Optimization Guide#

Tile IR programs run as portable bytecode that the compiler maps onto a specific GPU target. Most performance decisions — instruction selection, pipelining, register/shared-memory budget, warp specialization, and so on — are made by the compiler. This section describes the user-controlled knobs that influence those decisions and the situations in which adjusting them is worthwhile.

The current set of knobs is exposed as optimization hints attached to ops. Hints are advisory: they tell the compiler what to prefer when it has a choice, but the compiler may override a hint if it would produce an invalid or unprofitable program.

11.1. Optimization Hints#

Optimization hints are carried on the optimization_hints attribute (see the OptimizationHints attribute reference). The attribute is a dictionary keyed by architecture (e.g. sm_100, sm_120), and the value for each architecture is itself a dictionary of per-op hints. A default key may be used to provide a value that applies whenever no target-specific entry is present and the default value is valid for the current compilation target.

Example:

optimization_hints=<
  sm_100 = {num_cta_in_cga = 8, num_worker_warps_per_cta = 8},
  sm_120 = {num_cta_in_cga = 16, num_worker_warps_per_cta = 4},
  default = {latency = 4}
>

The available hints are enumerated below. Each entry lists the hint name, where it applies, accepted values, and how it is used.

11.1.1. num_cta_in_cga#

  • Applies to: cuda_tile.entry

  • Type: integer

  • Accepted values: powers of 2 in the closed range [1, 16].

  • Default: 1 (single-CTA grouping).

Suggests the number of CTAs that the compiler should group together when launching the kernel. Larger groupings allow more inter-CTA cooperation — including dual-CTA MMA instructions, multicast TMA, distributed shared memory, and cluster-wide barriers — but place additional constraints on resource usage and on the launch grid. When the value is 1, none of these multi-CTA hardware features are available.

11.1.2. num_worker_warps_per_cta#

Suggests the number of warps in each CTA that should act as compute workers. Larger values increase the amount of work in flight per CTA at the cost of higher register pressure per warp. When the compiler synthesizes a producer/consumer (warp-specialized) schedule, reducing the number of worker warps frees warps to act as data movers, which can hide memory latency on bandwidth-bound kernels.

11.1.3. allow_tma#

Suggests whether the compiler should use the Tensor Memory Accelerator (TMA) to implement view-based loads and stores when the target supports it. The default is to use TMA when possible. Set this to false to force a path that does not use TMA — useful for debugging and for kernels where TMA setup overhead is not amortized.

11.1.4. latency#

A unitless score that controls the compiler’s memory-prefetch heuristic for this load or store. Larger values bias the compiler toward a deeper prefetch pipeline, which tends to perform better when L2 traffic is heavy and the load latency is high. Smaller values bias toward shallower pipelines, which reduce register pressure and shared-memory occupancy. The default -1 lets the compiler choose a value in 1..10 based on its own latency estimate; supplying an explicit value overrides that estimate.

Note

Each hint above is advisory. If a hint conflicts with hardware constraints, with other hints, or with what the compiler determines to be a valid and efficient schedule, it may be ignored. The compiler will not diagnose a hint that it chose not to apply.