Optimization Hints
A tile optimization hint is metadata attached to a C++ source construct such as an entity, statement, or expression which is used to guide compiler optimizations related to that construct. Hints do not affect the semantics of the program and may be ignored by the compiler.
Example
The first hint below tells the compiler to prefer \(4\) thread blocks for
each thread block cluster when launching the kernel. The second hint tells
the compiler to avoid using TMA when loading data from x.
Both hints might be ignored by the compiler.
[[ cutile::hint(0, num_cta_per_cga=4) ]]
__tile_global__ void kernel(int* x) {
namespace ct = ::cuda::tiles;
using namespace ct::literals;
auto S = ct::tensor_span{x, ct::shape{512_ic, 128}};
auto P = ct::partition_view{S, ct::shape{32_ic, 64_ic}};
ct::tile<int, ct::shape<32, 64>> y;
[[ cutile::hint(0, allow_tma=false) ]]
y = P.load(0, 1);
}
Hint Specification
A tile optimization hint contains the following information:
A numeric encoding of the target architecture for which the hint will be applicable. The architecture’s numeric encoding is the same as that used by the
__CUDA_ARCH__macro. When compiling for a specific target, only hints matching that target will be considered. The special value of0indicates a hint that applies to all architectures. Such hints are called “architecture agnostic”.A hint kind which indicates the type of information that the hint specifies. The supported hint kinds are specified below.
A hint value whose interpretation depends on the hint kind.
A tile optimization hint may appertain to one or more C++ constructs. When a hint appertains to a construct, the implementation may use the hint to make optimization decisions related to that construct.
Tile optimization hints are specified using the C++ attribute
cutile::hint in the cutile attribute namespace.
The attribute-argument-clause 1 of a cutile::hint
attribute shall satisfy the following grammar production rules in addition to
those production rules required by the C++ standard 2:
attribute-argument-clause ::= "(" constant-expression "," tile-hint-kv-pair-list ")"
tile-hint-kv-pair-list ::= tile-hint-kv-pair
tile-hint-kv-pair "," tile-hint-kv-pair-list
tile-hint-kv-pair ::= tile-hint-kind "=" constant-expression
tile-hint-kind ::= identifier | keyword
Note
The grammar is unambiguous because constant-expression cannot produce a top-level comma expression.
// Syntax error, '3' is not a tile-hint-kv-pair
[[ cutile::hint(0, latency=4, 3) ]]
// OK '(4, 3)' is an integral constant expression yielding '3'.
[[ cutile::hint(0, latency=(4, 3) )]]
The first argument of the attribute-argument-clause shall be an integral constant expression whose value is the attribute’s target architecture. For each tile-hint-kv-pair the constant-expression shall be an integral constant expression designating the key-value pair’s value.
The target architecture and the value of each key-value pair may be type or
value dependent. A cutile::hint attribute may participate in a pack
expansion in which case a separate instance of the attribute is created for each
element of the pack.
Example
Zero or more cutile::hint attributes may be generated for the function
entity yielded by a given template instantiation of kernel:
template<size_t> struct ints { };
template<size_t... Archs, size_t... Xs>
[[ cutile::hint(Archs, occupancy = Xs) ... ]]
__tile_global__ void kernel(ints<Archs...>, ints<Xs...>) { }
Each tile-hint-kv-pair of a cutile::hint attribute generates a tile
optimization hint if
The target architecture of the enclosing attribute is recognized by the implementation. This becomes the generated hint’s architecture.
The tile-hint-kind of the key-value pair is one of the known list of hint kinds specified below. This becomes the generated hint’s kind.
The key-value pair’s value is valid for the associated hint kind and architecture as specified below. This becomes the generated hint’s value.
Note
If a tile-hint-kv-pair does not generate a tile optimization hint, it is ignored.
A tile optimization appertains to a construct if:
For function entities, there exists a
cutile::hintattribute which appertains 1 to the function and generates the optimization hint.For expressions, the expression inhabits an expression-statement, and there exists a
cutile::hintattribute which appertains 1 to the enclosing statement and generates the optimization hint.The tile optimization hint is relevant to this construct according to the rules specific to that hint kind defined below.
Note
If a generated hint does not appertain to any entities, it is ignored.
namespace ct = ::cuda::tiles;
// Ignored, not an expression-statement
[[ cutile::hint(0, latency=4) ]]
auto x = ct::load(y);
// Ignored, not an expression-statement
[[ cutile::hint(0, latency=4) ]]
if (auto x = ct::load(y)) { }
// Ignored, not an expression-statement
[[ cutile::hint(0, latency=4) ]]
return ct::load(y);
If two or more tile hints with the same kind and architecture appertain to the same construct, only the one generated by the tile-hint-kv-pair and attribute that are last in lexical order is considered. All others are ignored.
Example
In the following code, only the latency=5 hint is applied
in all three examples.
namespace ct = ::cuda::tiles;
// Example 1
[[ cutile::hint(0, latency=4, latency=5) ]]
x = ct::load(y);
// Example 2
[[ cutile::hint(0, latency=4) ]]
[[ cutile::hint(0, latency=5) ]]
x = ct::load(y);
// Example 3
[[ using cutile :
hint(0, latency=4),
hint(0, latency=5)
]]
x = ct::load(y);
If an architecture agnostic hint does not have a supported kind or value for the targeted architecture, the hint is ignored for that architecture.
Example
The hint below is ignored when targeting sm_80 or below, but is active for other architectures.
[[ cutile::hint(0, num_cta_in_cga=2) ]]
__tile__ void kernel0() { }
When both an architecture agnostic and an architecture specific hint with the same kind appertain to the same entity, the architecture specific hint takes precedence when the matching architecture is targeted.
Example
In the code below, the latency is specified to be \(4\) for all architectures except sm_100 where it is specified as \(5\).
namespace ct = ::cuda::tiles;
[[ using cutile : hint(0, latency=4), hint(1000, latency=5) ]]
ct::load(x);
Hint Kinds
num_cta_in_cga
The num_cta_in_cga hint kind indicates the recommended number of
cooperative thread arrays that should be allocated per cooperative group array
when launching a kernel. The supported values for this hint are \(1\),
\(2\), \(4\), \(8\), and \(16\). This hint may appertain only to
tile kernel functions.
For a num_cta_in_cga hint with architecture sm_80, the only applicable value is \(1\).
Example
In the following example, the compiler is instructed to prefer \(4\) thread blocks per cluster for sm_90 targets and \(8\) thread blocks per cluster for sm_100 targets when launching the kernel.
[[ using cutile :
hint(900, num_cta_in_cga=4),
hint(1000, num_cta_in_cga=8) ]]
__tile_global__ void kernel1() { }
occupancy
The occupancy hint kind indicates the recommended occupancy when launching a
kernel. The supported values for this hint are the inclusive integer range
\([1, 32]\). This hint may appertain only to tile kernel functions.
Example
In the following example, the compiler is instructed prefer an occupancy of \(15\) for all architectures except for sm_90 where it will prefer \(22\).
[[ using cutile :
hint(0, occupancy=15),
hint(900, occupancy=22) ]]
__tile_global__ void kernel2() { }
latency
The latency hint kind indicates the relative memory access latency of a
memory operation. The supported values for this hint are the inclusive integer
range \([1, 10]\). This hint may appertain only to direct call expressions
to:
Example
In the following example, a latency hint of \(7\) is specified on all the call expressions.
namespace ct = ::cuda::tiles;
// Example 1
[[ cutile::hint(0, latency=7) ]]
z = ct::load(x) + ct::load(y);
// Example 2
[[ cutile::hint(0, latency=7) ]]
ct::store(ptr, z);
using namespace ct::literals;
auto t = ct::tensor_span{ptr, ct::shape{128_ic, 4_ic}};
auto p = ct::partition_view{t, ct::shape{4_ic, 4_ic}};
ct::tile<int, ct::shape<4, 4>> value;
[[ cutile::hint(0, latency=7) ]]
value = p.load(0, 0);
[[ cutile::hint(0, latency=7) ]]
p.store(value, 0, 0);
allow_tma
The allow_tma hint kind instructs the compiler to prefer traditional
memory operations over Tensor Memory Accelerators (TMA) when loading or storing to view memory.
The supported values for this hint are true and false.
This hint may appertain only to direct call expressions to:
Example
In the following example, the hints instruct the compiler to prefer not to use TMA for the partition view load and store.
namespace ct = ::cuda::tiles;
using namespace ct::literals;
auto t = ct::tensor_span{ptr, ct::shape{128_ic, 4_ic}};
auto p = ct::partition_view{t, ct::shape{4_ic, 4_ic}};
ct::tile<int, ct::shape<4, 4>> value;
[[ cutile::hint(0, allow_tma=false) ]]
value = p.load(0, 0);
[[ cutile::hint(0, allow_tma=false) ]]
p.store(value, 0, 0);
Footnotes