Optimization Hints

A tile optimization hint is metadata attached to a C++ source construct such as an entity, statement, or expression which is used to guide compiler optimizations related to that construct. Hints do not affect the semantics of the program and may be ignored by the compiler.

Example

The first hint below tells the compiler to prefer \(4\) thread blocks for each thread block cluster when launching the kernel. The second hint tells the compiler to avoid using TMA when loading data from x.

Both hints might be ignored by the compiler.

[[ cutile::hint(0, num_cta_per_cga=4) ]]
__tile_global__ void kernel(int* x) {
  namespace ct = ::cuda::tiles;
  using namespace ct::literals;
  auto S = ct::tensor_span{x, ct::shape{512_ic, 128}};
  auto P = ct::partition_view{S, ct::shape{32_ic, 64_ic}};

  ct::tile<int, ct::shape<32, 64>> y;

  [[ cutile::hint(0, allow_tma=false) ]]
  y = P.load(0, 1);
}

Hint Specification

A tile optimization hint contains the following information:

A numeric encoding of the target architecture for which the hint will be applicable. The architecture’s numeric encoding is the same as that used by the __CUDA_ARCH__ macro. When compiling for a specific target, only hints matching that target will be considered. The special value of 0 indicates a hint that applies to all architectures. Such hints are called “architecture agnostic”.
A hint kind which indicates the type of information that the hint specifies. The supported hint kinds are specified below.
A hint value whose interpretation depends on the hint kind.

A tile optimization hint may appertain to one or more C++ constructs. When a hint appertains to a construct, the implementation may use the hint to make optimization decisions related to that construct.

Tile optimization hints are specified using the C++ attribute cutile::hint in the cutile attribute namespace. The attribute-argument-clause 1 of a cutile::hint attribute shall satisfy the following grammar production rules in addition to those production rules required by the C++ standard 2:

attribute-argument-clause ::=  "(" constant-expression "," tile-hint-kv-pair-list ")"
tile-hint-kv-pair-list    ::=  tile-hint-kv-pair
                               tile-hint-kv-pair "," tile-hint-kv-pair-list
tile-hint-kv-pair         ::=  tile-hint-kind "=" constant-expression
tile-hint-kind            ::=  identifier | keyword

Note

The grammar is unambiguous because constant-expression cannot produce a top-level comma expression.

// Syntax error, '3' is not a tile-hint-kv-pair
[[ cutile::hint(0, latency=4, 3) ]]

// OK '(4, 3)' is an integral constant expression yielding '3'.
[[ cutile::hint(0, latency=(4, 3) )]]

The first argument of the attribute-argument-clause shall be an integral constant expression whose value is the attribute’s target architecture. For each tile-hint-kv-pair the constant-expression shall be an integral constant expression designating the key-value pair’s value.

The target architecture and the value of each key-value pair may be type or value dependent. A cutile::hint attribute may participate in a pack expansion in which case a separate instance of the attribute is created for each element of the pack.

Example

Zero or more cutile::hint attributes may be generated for the function entity yielded by a given template instantiation of kernel:

template<size_t> struct ints { };

template<size_t... Archs, size_t... Xs>
[[ cutile::hint(Archs, occupancy = Xs) ... ]]
__tile_global__ void kernel(ints<Archs...>, ints<Xs...>) { }

Each tile-hint-kv-pair of a cutile::hint attribute generates a tile optimization hint if

The target architecture of the enclosing attribute is recognized by the implementation. This becomes the generated hint’s architecture.
The tile-hint-kind of the key-value pair is one of the known list of hint kinds specified below. This becomes the generated hint’s kind.
The key-value pair’s value is valid for the associated hint kind and architecture as specified below. This becomes the generated hint’s value.

Note

If a tile-hint-kv-pair does not generate a tile optimization hint, it is ignored.

A tile optimization appertains to a construct if:

For function entities, there exists a cutile::hint attribute which appertains 1 to the function and generates the optimization hint.
For expressions, the expression inhabits an expression-statement, and there exists a cutile::hint attribute which appertains 1 to the enclosing statement and generates the optimization hint.
The tile optimization hint is relevant to this construct according to the rules specific to that hint kind defined below.

Note

If a generated hint does not appertain to any entities, it is ignored.

namespace ct = ::cuda::tiles;

// Ignored, not an expression-statement
[[ cutile::hint(0, latency=4) ]]
auto x = ct::load(y);

// Ignored, not an expression-statement
[[ cutile::hint(0, latency=4) ]]
if (auto x = ct::load(y)) { }

// Ignored, not an expression-statement
[[ cutile::hint(0, latency=4) ]]
return ct::load(y);

If two or more tile hints with the same kind and architecture appertain to the same construct, only the one generated by the tile-hint-kv-pair and attribute that are last in lexical order is considered. All others are ignored.

Example

In the following code, only the latency=5 hint is applied in all three examples.

namespace ct = ::cuda::tiles;

// Example 1
[[ cutile::hint(0, latency=4, latency=5) ]]
x = ct::load(y);

// Example 2
[[ cutile::hint(0, latency=4) ]]
[[ cutile::hint(0, latency=5) ]]
x = ct::load(y);

// Example 3
[[ using cutile :
  hint(0, latency=4),
  hint(0, latency=5)
]]
x = ct::load(y);

If an architecture agnostic hint does not have a supported kind or value for the targeted architecture, the hint is ignored for that architecture.

Example

The hint below is ignored when targeting sm_80 or below, but is active for other architectures.

[[ cutile::hint(0, num_cta_in_cga=2) ]]
__tile__ void kernel0() { }

When both an architecture agnostic and an architecture specific hint with the same kind appertain to the same entity, the architecture specific hint takes precedence when the matching architecture is targeted.

Example

In the code below, the latency is specified to be \(4\) for all architectures except sm_100 where it is specified as \(5\).

namespace ct = ::cuda::tiles;

[[ using cutile : hint(0, latency=4), hint(1000, latency=5) ]]
ct::load(x);

Hint Kinds

num_cta_in_cga

The num_cta_in_cga hint kind indicates the recommended number of cooperative thread arrays that should be allocated per cooperative group array when launching a kernel. The supported values for this hint are \(1\), \(2\), \(4\), \(8\), and \(16\). This hint may appertain only to tile kernel functions.

For a num_cta_in_cga hint with architecture sm_80, the only applicable value is \(1\).

Example

In the following example, the compiler is instructed to prefer \(4\) thread blocks per cluster for sm_90 targets and \(8\) thread blocks per cluster for sm_100 targets when launching the kernel.

[[ using cutile :
  hint(900, num_cta_in_cga=4),
  hint(1000, num_cta_in_cga=8) ]]
__tile_global__ void kernel1() { }

occupancy

The occupancy hint kind indicates the recommended occupancy when launching a kernel. The supported values for this hint are the inclusive integer range \([1, 32]\). This hint may appertain only to tile kernel functions.

Example

In the following example, the compiler is instructed prefer an occupancy of \(15\) for all architectures except for sm_90 where it will prefer \(22\).

[[ using cutile :
  hint(0, occupancy=15),
  hint(900, occupancy=22) ]]
__tile_global__ void kernel2() { }

latency

The latency hint kind indicates the relative memory access latency of a memory operation. The supported values for this hint are the inclusive integer range \([1, 10]\). This hint may appertain only to direct call expressions to:

Example

In the following example, a latency hint of \(7\) is specified on all the call expressions.

namespace ct = ::cuda::tiles;

// Example 1
[[ cutile::hint(0, latency=7) ]]
z = ct::load(x) + ct::load(y);

// Example 2
[[ cutile::hint(0, latency=7) ]]
ct::store(ptr, z);

using namespace ct::literals;
auto t = ct::tensor_span{ptr, ct::shape{128_ic, 4_ic}};
auto p = ct::partition_view{t, ct::shape{4_ic, 4_ic}};

ct::tile<int, ct::shape<4, 4>> value;

[[ cutile::hint(0, latency=7) ]]
value = p.load(0, 0);

[[ cutile::hint(0, latency=7) ]]
p.store(value, 0, 0);

allow_tma

The allow_tma hint kind instructs the compiler to prefer traditional memory operations over Tensor Memory Accelerators (TMA) when loading or storing to view memory. The supported values for this hint are true and false. This hint may appertain only to direct call expressions to:

Example

In the following example, the hints instruct the compiler to prefer not to use TMA for the partition view load and store.

namespace ct = ::cuda::tiles;
using namespace ct::literals;
auto t = ct::tensor_span{ptr, ct::shape{128_ic, 4_ic}};
auto p = ct::partition_view{t, ct::shape{4_ic, 4_ic}};

ct::tile<int, ct::shape<4, 4>> value;

[[ cutile::hint(0, allow_tma=false) ]]
value = p.load(0, 0);

[[ cutile::hint(0, allow_tma=false) ]]
p.store(value, 0, 0);

Footnotes

1(1,2,3): See § 9.12.1 [dcl.attr.grammar] of ISO/IEC 14882:2024
2: See § 5.10 [lex.name], § 5.11 [lex.key], and § 7.7 [expr.const] of ISO/IEC 14882:2024 for definitions of the grammar symbols identifier, keyword and constant-expression.