Performance Tuning#
Several performance tuning techniques are available in cuTile:
architecture-specific configuration values, using
ByTarget;load/store hints such as
latencyandallow_tma.
Architecture-specific configuration#
- class cuda.tile.ByTarget(*, default=UNSPECIFIED, **value_by_target)#
Type used to specify a value that depends on the target GPU architecture.
- Parameters:
default – The fallback value to use when the target GPU architecture is not explicitly listed in
value_by_target.value_by_target – Mapping from GPU architecture name to value. Keys must be strings of the form
"sm_<major><minor>", such as"sm_100"or"sm_120".
Examples
Use one
num_ctasvalue for all architectures:from cuda.tile import kernel, ByTarget @kernel(num_ctas=8) def kernel_fn(x): ...
Use different
num_ctasvalues for specific architectures, and a fallback value for all others:from cuda.tile import kernel, ByTarget @kernel(num_ctas=ByTarget(sm_100=8, sm_120=4, default=2)) def kernel_fn(x): ...
See Tile Kernels for the full description of kernel configuration
parameters such as num_ctas, occupancy and opt_level. Any of
these options may be given as a ByTarget value to specialize them
for different GPU architectures.
Load/store performance hints#
The load() and store() operations accept optional keyword
arguments that can influence how memory traffic is scheduled and lowered:
latency(intorNone) – A hint indicating how heavy the DRAM traffic will be for this operation. It shall be an integer between 1 (low) and 10 (high). If it isNoneor not provided, the compiler will infer the latency.allow_tma(boolorNone) – IfTrue, the load or store may be lowered to use TMA (Tensor Memory Accelerator) when the target architecture supports it. IfFalse, TMA will not be used for this operation. By default, TMA is allowed.
These hints are optional: kernels will compile and run without specifying them, but providing them can help the compiler make better code-generation decisions for a particular memory-access pattern.
Example#
import cuda.tile as ct
TILE_SIZE = 16
@ct.kernel
def load_store_with_hints_kernel(x, y):
bid = ct.bid(0)
tx = ct.load(
x,
index=(bid,),
shape=(TILE_SIZE,),
latency=8, # high-latency DRAM load
)
ct.store(
y,
index=(bid,),
tile=tx,
latency=2, # cheaper write
allow_tma=False, # disallow TMA
)