11. Optimization Guide#
Tile IR programs run as portable bytecode that the compiler maps onto a specific GPU target. Most performance decisions — instruction selection, pipelining, register/shared-memory budget, warp specialization, and so on — are made by the compiler. This section describes the user-controlled knobs that influence those decisions and the situations in which adjusting them is worthwhile.
The current set of knobs is exposed as optimization hints attached to ops. Hints are advisory: they tell the compiler what to prefer when it has a choice, but the compiler may override a hint if it would produce an invalid or unprofitable program.
11.1. Optimization Hints#
Optimization hints are carried on the optimization_hints attribute (see
the OptimizationHints attribute reference). The attribute is
a dictionary keyed by architecture (e.g. sm_100, sm_120), and the
value for each architecture is itself a dictionary of per-op hints. A
default key may be used to provide a value that applies whenever no
target-specific entry is present and the default value is valid for the
current compilation target.
Example:
optimization_hints=<
sm_100 = {num_cta_in_cga = 8, num_worker_warps_per_cta = 8},
sm_120 = {num_cta_in_cga = 16, num_worker_warps_per_cta = 4},
default = {latency = 4}
>
The available hints are enumerated below. Each entry lists the hint name, where it applies, accepted values, and how it is used.
11.1.1. num_cta_in_cga#
Applies to: cuda_tile.entry
Type: integer
Accepted values: powers of 2 in the closed range [1, 16].
Default:
1(single-CTA grouping).
Suggests the number of CTAs that the compiler should group together when
launching the kernel. Larger groupings allow more inter-CTA cooperation —
including dual-CTA MMA instructions, multicast TMA, distributed shared
memory, and cluster-wide barriers — but place additional constraints on
resource usage and on the launch grid. When the value is 1, none of
these multi-CTA hardware features are available.
11.1.2. num_worker_warps_per_cta#
Applies to: cuda_tile.entry
Type: integer
Accepted values:
4or8.
Suggests the number of warps in each CTA that should act as compute workers. Larger values increase the amount of work in flight per CTA at the cost of higher register pressure per warp. When the compiler synthesizes a producer/consumer (warp-specialized) schedule, reducing the number of worker warps frees warps to act as data movers, which can hide memory latency on bandwidth-bound kernels.
11.1.3. allow_tma#
Applies to: cuda_tile.load_view_tko, cuda_tile.store_view_tko
Type: boolean
Suggests whether the compiler should use the Tensor Memory Accelerator (TMA)
to implement view-based loads and stores when the target supports it. The
default is to use TMA when possible. Set this to false to force a path
that does not use TMA — useful for debugging and for kernels where TMA setup
overhead is not amortized.
11.1.4. latency#
Applies to: cuda_tile.load_view_tko, cuda_tile.store_view_tko
Type: integer
Accepted values: integer
1through10, or-1to let the compiler estimate.
A unitless score that controls the compiler’s memory-prefetch heuristic for
this load or store. Larger values bias the compiler toward a deeper prefetch
pipeline, which tends to perform better when L2 traffic is heavy and the
load latency is high. Smaller values bias toward shallower pipelines, which
reduce register pressure and shared-memory occupancy. The default -1
lets the compiler choose a value in 1..10 based on its own latency
estimate; supplying an explicit value overrides that estimate.
Note
Each hint above is advisory. If a hint conflicts with hardware constraints, with other hints, or with what the compiler determines to be a valid and efficient schedule, it may be ignored. The compiler will not diagnose a hint that it chose not to apply.