Link-time Optimization
Link-time Optimization
Link-time Optimization
JIT LTO (Just-In-Time Link-Time Optimization) is a CUDA compilation strategy that enables dynamic kernel compilation and linking at runtime. Instead of pre-compiling all possible kernel variants (which would result in an explosion of binary size), JIT LTO compiles kernel fragments separately and links them together on-demand when a specific kernel configuration is needed.
A fragment is a self-contained, compilable unit of CUDA code that can be linked with other fragments to form a complete kernel. In the JIT LTO system:
__global__ kernel function.The key advantage is that device functions can be compiled independently and reused across multiple kernel entrypoints, reducing compilation time and binary size.
Let’s walk through creating a JIT LTO kernel system for a search kernel with templated device functions.
We start with a kernel that has templated device functions that we want to separate into fragments:
search_kernel.cuh:
We’ll create separate header files for each device function variant. Each implements the generic function signature that the kernel expects:
compute_distance_euclidean.cuh:
compute_distance_inner_product.cuh:
filter_none.cuh:
filter_bitset.cuh:
JSON matrix files define all the parameter combinations that need to be compiled. The build system uses these to generate .cu files from .cu.in templates.
How JSON Cross-Product Works:
null, but only strings should be used, even for numbers, for example "1"._ to indicate that it is used only for grouping._._ prefix or not) is not followed.For example, if you have:
This generates 2 × 2 × 2 = 8 combinations:
{data_type: "float", idx_type: "uint32_t", capacity: "1"}{data_type: "float", idx_type: "uint32_t", capacity: "2"}{data_type: "float", idx_type: "int64_t", capacity: "1"}When a group contains nested arrays (like veclen: ["1", "4"]), those are also expanded within that group before the cross-product is computed.
compute_distance_matrix.jsonfilter_matrix.jsonsearch_kernel_matrix.jsonThis example demonstrates conditional combinations: OutT can be float or double when T is float, but only float when T is __half.
This generates 24 combinations (3 data/output type combinations × 2 index types × 4 optimized/veclen combinations):
float + float + uint32_t + optimized + veclen=1float + float + uint32_t + optimized + veclen=4float + float + uint32_t + standard + veclen=8float + float + uint32_t + standard + veclen=16float + double + uint32_t + optimized + veclen=1float + double + uint32_t + optimized + veclen=4float + double + uint32_t + standard + veclen=8float + double + uint32_t + standard + veclen=16__half + float + uint32_t + optimized + veclen=1__half + float + uint32_t + optimized + veclen=4__half + float + uint32_t + standard + veclen=8__half + float + uint32_t + standard + veclen=16int64_t (total: 24 combinations).cu.in Template FilesThe .cu.in files are templates that get instantiated for each combination in the JSON matrix. They contain explicit template instantiations.
compute_distance_kernel.cu.infilter_kernel.cu.insearch_kernel.cuh with Extern DeclarationsThe kernel header needs to declare generic extern device functions so the kernel code can call them. The specific implementations will be linked from fragments at runtime:
search_kernel.cuh:
search_kernel.cu.inThe .cu.in file only contains the explicit template instantiation:
Note: The kernel uses generic function templates (compute_distance<T> and apply_filter<IdxT>) that are resolved at link time. The specific implementations (euclidean vs inner_product, filter_none vs filter_bitset) are provided by the fragments that get linked together.
Fragment tags register the compiled fatbins so they can be loaded at runtime. They are used to help the linker find and include the relevant fatbins at build time. When calling generate_jit_lto_kernels(), we pass a FRAGMENT_TAG_FORMAT argument, which constructs the tag type from the given placeholders, and a FRAGMENT_TAG_HEADER_FILES argument, which specifies one or more header files that the fragment tags come from. The JIT+LTO system will then automatically generate and compile a .cpp file that registers the fragment using the provided tag.
Important: When requesting fragments from the AlgorithmPlanner, we use tags (like tag_f, tag_h) instead of real types (like float, __half) in the add_static_fragment template parameters. This avoids including heavy headers that define the actual types, significantly improving compilation times. The tags are lightweight empty structs that serve only as compile-time identifiers.
registration_tags.hpp
The planner is responsible for:
CRITICAL: The fragment keys constructed in the planner methods must match EXACTLY with the keys used in the corresponding FRAGMENT_TAG_FORMAT argument. Any mismatch will result in runtime linking failures.
search_planner.hpp:
Now we integrate the planner into the actual search function:
search_jit.cuh:
compute_distance and apply_filterWhat you’re building. The same search kernel as Steps 1–7 still calls compute_distance / apply_filter, but for a UDF build those symbols are not taken from prebuilt matrix fatbins: you compile a small NVRTC program per hook at runtime and register it with the planner so LTO links it next to the entry fragment.
How the pieces connect (arrows read left to right):
1. Shared header — forward declarations
The entry TU matches Step 1–7: templates are declared here and defined elsewhere at link time.
2. NVRTC source — macros and string factories
Use function-like macros so you edit only the { ... } body; the preprocessor still emits a real __device__ template for NVCC, and NAME_udf() / NAME_filter_udf() build the CUDA text NVRTC compiles. The distance macro also emits compute_distance_udf_impl calling NAME_distance; host-side instantiate_compute_distance_udf only appends the forwarding compute_distance plus its explicit instantiation (same idea as instantiate_apply_filter_udf for apply_filter<IdxT>).
Macro definitions (shared header; include before the invocations):
Invocations (file scope — each call expands the macro once):
Avoid raw " inside #BODY unless you splice that part with raw-string concatenation around #BODY.
3. Host — type_name, glue, compile, register
Each full NVRTC program is one string: *_udf() plus instantiate_* output. type_name<U>() must return the exact token the entry TU uses (e.g. float, uint32_t).
4. Planner — extend Step 7 for UDF vs static
Step 7 used only static fragments. Add #include <string>, extend enums/tags/get_*_tag as below, keep UDF glue in the same TU as search_jit, then swap the two unconditional add_compute_distance_device_function / add_filter_device_function calls for this block:
Use DistanceType::MetricUdf / FilterType::FilterUdf only when you want the NVRTC branches; otherwise keep Euclidean / None for the original static path.
Pitfalls and constraints
- Do not register the same hook through both UDF APIs (
add_metric_udf_fragment/add_filter_udf_fragment) and the Step 6 static helpers (add_compute_distance_function/add_filter_function): they pull different fatbins and you will duplicate device definitions.- The NVRTC program must define every template the entry calls and emit matching
template __device__ ...explicit instantiations for each concrete specialization (e.g.compute_distance<float>,apply_filter<uint32_t>). Prefer small host helpers (instantiate_*+type_name) for type spellings instead of hard-coding index types inside macro strings.- One NVRTC compile per logical TU; do not concatenate unrelated UDFs into one program string.
Fragment tags uniquely identify fragments. They’re simple lightweight types that are passed as the
sole template parameter to StaticFatbinFragmentEntry:
Fragment tags may themselves take template parameters in order to uniquely identify them. Typically, one fragment tag template will correspond to a single function, and a fragment tag template specialization will correspond to a function specialization.
When a fatbin is compiled and embedded in C++ code, a translation unit specializes StaticFatbinFragmentEntry
to specify its data and length static fields:
Then, an AlgorithmPlanner can call add_static_fragment() with the fragment tag (NOT the StaticFatbinFragmentEntry
specialization) as the sole template parameter:
At build time, the linker takes care of finding and including the static fragments that have been specified by the algorithm planner.
Registration tags are type-safe identifiers used to organize fragments. They’re typically empty structs:
These tags are used in registerAlgorithm<>() to create a hierarchical organization of fragments.
The AlgorithmLauncher is the runtime handle for a linked kernel. It:
cudaKernel_t handle to the linked kernelcall() and call_cooperative() methods to launch the kernelcudaLibrary_t that contains the kernelMinimize Includes: JIT LTO fragments should have minimal includes, especially avoiding host-side headers. Extract device-only code into separate headers.
Fragment Granularity: Balance between too many small fragments (overhead) and too few large fragments (less reuse). Device functions that are reused across multiple kernels are good candidates for separate fragments.
Naming Consistency: Ensure fragment tags match exactly between registration and lookup. Use helper functions to construct tags consistently.
Type Safety: Use registration tags to provide compile-time type safety and avoid runtime string mismatches.
Caching: Each planner type should hold a static LauncherJitCache and pass it to AlgorithmPlanner; get_launcher() then reuses linked kernels for the same fragment key within that cache.
IVF Flat uses JIT LTO with:
Total: 341 fatbins that can be combined into many more kernel variants at runtime.
To integrate JIT LTO kernels into the CMake build system, add calls to generate_jit_lto_kernels() in your main CMakeLists.txt file (typically in cpp/CMakeLists.txt).
The generate_jit_lto_kernels() function (defined in cmake/modules/generate_jit_lto_kernels.cmake) takes:
NAME_FORMAT: Format string for generated kernel names (using @variable@ syntax)MATRIX_JSON_FILE: Path to the JSON matrix fileKERNEL_INPUT_FILE: Path to the .cu.in templateFRAGMENT_TAG_FORMAT: Format string for fragment tag type (using @variable@ syntax)FRAGMENT_TAG_HEADER_FILES: List of header files that provide the fragment tag types (can be enclosed in </> or "/", automatically enclosed in quotes if quotes and brackets are not provided)OUTPUT_DIRECTORY: Where generated files are placedKERNEL_LINK_LIBRARIES: Interface library with compilation settingsCall generate_jit_lto_kernels() once for each fragment type (compute_distance, filter, search_kernel, etc.). The function reads the JSON matrix, computes the cross-product of all combinations, generates .cu and .cpp files from the templates, compiles them into fatbins, and returns a list of generated source files that should be added to your JIT LTO library target.
See the CUVS cpp/CMakeLists.txt file for a complete example of how to set up the interface library, call generate_jit_lto_kernels() for each fragment type, and create the final library target.
JIT LTO enables:
The process involves:
.cu.in templates for explicit instantiationsJIT LTO kernels are split into fragments, which are fatbins containing individual pieces of code that can be strung together rather than instantiating the whole kernel at once. Each fragment only needs to be multiplied out over the dimensions (template parameters) that the fragment itself contains rather than the kernel as a whole. At runtime, these fragments are combined together by nvjitlink into the final program.
In JIT LTO, there are two kinds of code: algorithms and adapters. Algorithms are, roughly speaking, code that actually “does stuff” - searching, sorting, even as simple as initializing variables. Adapters don’t do anything by themselves, but are merely thin wrappers around algorithms that exist only for reducing the number of template parameters that the caller needs to know about. It should generally be assumed that algorithm code is expensive to multiply over a matrix, and thus such multiplication should be minimized, while adapter code is cheap to multiply.
An algorithm function is a function that contains real code for the algorithm, and an adapter function merely calls an algorithm function with more template parameters than the adapter function itself has. An algorithm file contains algorithm code, and an adapter file contains adapter code.
Here is an example of an algorithm file that contains an algorithm function:
Here is an example of an adapter file that contains an adapter function:
This is the most common pattern that you will see in NVIDIA cuVS’s JIT LTO code. Note that any code that calls is_divisible() does not
need to know the value of Divisor, which allows the caller to be multiplied over fewer dimensions, thus reducing the amount of code
generated.
Note that in the above adapter file, @data_type@ and @divisor@ are build-time substitutions performed by CMake. These
substitutions will be filled in with values from the matrix product. Note that they are all grouped together in a single namespace,
making it easy to find all substitutions. This should be preferred to sprinkling the substitutions throughout the code.
Here is an example with two algorithm files:
And here is the accompanying adapter file:
This is another common pattern that you will see in NVIDIA cuVS JIT LTO. Note that the adapter file does not contain any adapter functions, but merely instantiates a different algorithm function based on which algorithm file is included based on the CMake substitution.
When a piece of algorithm code is used in multiple kernels, it should be split into its own shared fragment. At this point, it becomes important to also distinguish algorithm fragments and adapter fragments. An algorithm fragment contains an algorithm function that exposes all of the relevant template parameters, and this fragment is shared between multiple kernels. An adapter fragment is specific to a kernel. If a kernel wishes to invoke the same shared algorithm multiple times in the same run with different template parameters, it can employ multiple adapter fragments to accomplish this. Consider the following header file:
And the following adapter files:
And the following algorithm file:
Note that filter_first_pass and filter_second_pass both invoke one of the filter functions, but which one they invoke is
decided independently for each. Also note that neither of the adapter fragments contains the underlying algorithm code, but rather
links against the corresponding shared algorithm fragments.
The key to minimizing code generation is to minimize the number of dimensions that any given fragment needs to be multiplied out over. If a section of algorithm code uses lots of template parameters, try to separate out sections that use only a subset of these parameters, put them into their own fragment, and remove the corresponding template parameters from the caller. Make judicious use of adapter code to accomplish this. An adapter function should only have the template parameters that appear in its signature, whereas an algorithm function should have all of the template parameters that appear in its signature or its implementation.
Unoptimized algorithm:
Note that the algorithm includes the Comparand template parameter, which means the entire algorithm has to be multiplied out over
all the possible values of this parameter.
Optimized algorithm:
We are now using an adapter function (possibly inside an adapter fragment) called filter_less_than to invoke
filter_less_than_impl (which may be inside a shared algorithm fragment). This allows us to hide the Comparand parameter
from find_first, which means we no longer need to multiply the entire algorithm over all possible values of Comparand, only the
filter_less_than adapter and algorithm.