Augment cuFFTDx with LTO#

This section shows how the LTO features introduced in cuFFTDx + cuFFT LTO EA can be integrated into existing cuFFTDx projects.

We will cover two use cases:

  • Offline Kernel Generation: A user is optimizing kernels with knowledge of the required FFTs ahead of compilation and would like to avoid any runtime cost by generating kernels offline using NVCC.

  • Online Kernel Generation: A user does not know which FFTs to compute until execution and needs to generate kernels at runtime using NVRTC and nvJitLink.

Use Case I: Offline Kernel Generation#

We use code snippets from 09_introduction_lto_example/00_introduction_lto_example to demonstrate how to integrate LTO features into your codebase for the offline kernel generation use case. This example shows how to modify the existing non-LTO sample 02_simple_fft_block/00_simple_fft_block to add LTO support.

The modifications required to add LTO support to 00_simple_fft_block can be divided into three categories:

  1. LTO database creation.

  2. Source code changes.

  3. Build changes.

Note

Here we show how the modifications are done with direct compilation. For CMake users, please refer to cuFFTDx with LTO in your CMake Project.

LTO Database Creation#

Before making any changes to the project source code or build, the first modification required to add LTO support involves constructing an LTO database. The database consists of a C++ header file as well as .fatbin and .ltoir files. These files must be included and linked with the project, and therefore must be created before the project is compiled. To generate the database, one can use the helper tool, LTO Helper, provided by the cuFFTDx + cuFFT LTO EA package. The LTO Helper constructs a database using a CSV file of FFT descriptions. Please refer to LTO Helper for an in-depth overview of the tool.

00_introduction_lto_example builds the database using the following steps:

  1. Create a CSV file containing a list of all FFT descriptions that should be included in the database. Below is an example CSV file, 00_introduction_lto_example.csv, that inquires the LTO Helper to produce an LTO database for computing a C2C forward FFT of size 128 (the FFT configuration used in 00_introduction_lto_example).

size,direction,exec_op
128,fft_direction::forward,Block

Note

For the complete list of traits that can be included in the CSV file, please visit LTO Helper. Note that the target CUDA architecture is provided as an argument to the LTO Helper and not as a trait within the CSV file.

  1. Compile the LTO Helper.

g++ -I<cufft_include_path> \
    -L<cufft_lib_path> \
    -I<cuda_include_path> \
    -lcufft cufftdx_cufft_lto_helper.cpp -o cufftdx_cufft_lto_helper
  1. Run the LTO Helper, providing the output directory, CSV file, and the target CUDA architectures, to produce and write the LTO database fatbin(s), ltoir(s) and a C++ header file to disk. The CUDA architectures argument, --CUDA_ARCHITECTURES, is optional and, if omitted, results in LTO database being built for all supported architectures.

./cufftdx_cufft_lto_helper <output_dir_path>
                           00_introduction_lto_example.csv \
                           --CUDA_ARCHITECTURES=XX;YY;...

Source Code Changes#

The source code changes necessary to add LTO support to an existing project are minimal. Two adjustments are needed:

  1. Add an include statement for the database header file created by the LTO Helper.

#include "/path/to/lto_database.hpp.inc"
  1. Modify the FFT description type to include the LTOIR codetype.

using FFT_without_code_type = decltype(cufftdx::Block() +
                                       cufftdx::Size<128>() +
                                       cufftdx::Type<cufftdx::fft_type::c2c>() +
                                       cufftdx::Direction<cufftdx::fft_direction::forward>() +
                                       cufftdx::Precision<float>() +
                                       cufftdx::ElementsPerThread<8>() +
                                       cufftdx::FFTsPerBlock<2>() +
                                       cufftdx::SM<Arch>());
using FFT = decltype(FFT_without_code_type() + cufftdx::experimental::CodeType<cufftdx::experimental::code_type::ltoir>());

Note

If the FFT traits do not coincide with an entry in the LTO database or are not supported for the targetted architecture it is possible that a non-LTO implementation is chosen. To validate that LTO device code has been selected from the database, the following static assert can be used:

static_assert(FFT::code == cufftdx::experimental::code_type::ltoir, "Selected implementation code type is not LTOIR.");

Build Changes#

The following is a list of steps for building and linking user code with the generated LTO database.

  1. Compile the .cu file which contains kernel(s) utilizing cuFFTDx operations into LTOIR. The -dc flag indicates the file contains relocatable device code while the --generate-code arch=compute_75,code=lto_75 flag tells nvcc to generate LTOIR for the desired architecture(s).

nvcc -std=c++17 \
     -dc \
     --generate-code arch=compute_75,code=lto_75 \
     -I<cufftdx_include_path> \
     -I/path/to/lto_database.hpp.inc \
     00_introduction_lto_example.cu -o 00_introduction_lto_example.o
  1. Device link the .o file from step 1 with the LTO database fatbin(s) and ltoir(s), producing a new object containing the linked device code. Note that the --generate-code flag provided in this command uses code=sm_75, not code=lto_75, as this command specifies the output architecture produced when linking all input LTOIR.

nvcc -dlto \
     -dlink \
     --generate-code arch=compute_75,code=sm_75 \
     /path/to/database_X.fatbin ... \
     /path/to/database_X.ltoir ... \
     00_introduction_lto_example.o -o 00_introduction_lto_example_dlink.o
  1. Host link the object files from step 1 and step 2 to produce the final executable.

g++ -L<cuda_lib64_path> -lcudart \
    00_introduction_lto_example_dlink.o 00_introduction_lto_example.o -o 00_introduction_lto_example

Use Case II: Online Kernel Generation#

We use code snippets from 04_nvrtc_fft/03_nvrtc_fft_block_lto to demonstrate how to integrate LTO features into your codebase for the online kernel generation use case. This example shows how to modify the existing non-LTO NVRTC sample 04_nvrtc_fft/02_nvrtc_fft_thread to add LTO support.

The modifications required to add LTO support to 04_nvrtc_fft/02_nvrtc_fft_thread can be divided into two categories:

  1. Source code changes.

  2. Build changes.

Source Code Changes#

In contrast to the offline kernel generation use case, most of the changes in online kernel generation happen in the source code:

  1. LTO Database Creation.

    Instead of writing the LTO database (the C++ header and the LTOIRs) to disk, one can use the cufftdx::utils::get_database_and_ltoir function to get pointers to the C++ header and the LTOIRs. For more details about the function, please refer to (online) LTO Database Creation.

    auto [lto_db, ltoirs, block_dim, shared_memory_size] =
       cufftdx::utils::get_database_and_ltoir(
          fft_size,
          CUFFT_DESC_INVERSE,
          CUFFT_DESC_C2C,
          example::nvrtc::get_device_architecture(current_device) * 10,
          CUFFT_DESC_BLOCK,
          CUFFT_DESC_DOUBLE,
          CUFFT_DESC_NORMAL,
          fft_ept
       );
    

    Note

    The function is a wrapper around the cuFFT Device API (see cuFFT Device API Reference). To use it, one needs to:

    1. Define the macro CUFFTDX_ENABLE_CUFFT_DEPENDENCY.

    2. Link against the cuFFT library.

  2. CUDA Source Code Changes.

    The changes here are the same as in the offline kernel generation use case. We need to:

    1. Include the database header (lto_db).

    2. Modify the FFT description type to include the LTOIR codetype.

    const char* test_kernel = R"kernel(
    using namespace cufftdx;
    
    // FFT Operators
    using size_desc = Size<FFT_SIZE>;
    using dir_desc  = Direction<fft_direction::inverse>;
    using type_c2c  = Type<fft_type::c2c>;
    using FFT_without_code_type = decltype(Block() + size_desc() + dir_desc() + type_c2c() + Precision<double>() + SM<FFT_SM>() + ElementsPerThread<FFT_EPT>());
    using FFT = decltype(FFT_without_code_type() + experimental::CodeType<experimental::code_type::ltoir>());
    
    // ...
    )kernel";
    
    int main()
    {
       std::string dx_source_code;
       dx_source_code.append("#include <cufftdx.hpp>\n");
       dx_source_code.append(lto_db); // Include LTO database header
       dx_source_code.append(test_kernel);
       // ...
    }
    
  3. Compile the CUDA Source Code to LTOIR using NVRTC.

    The changes boil down to adding the -dlto and --relocatable-device-code=true flags to the NVRTC compilation options.

  4. Link the LTOIRs, generated online by NVRTC and dumped by cuFFT, using nvJitLink.

    nvJitLinkAddData(handle, NVJITLINK_INPUT_ANY, ltoir.data(), ltoir_size, "nvrtc_ltoir");
    for (unsigned i = 0; i < ltoirs.size(); i++) {
       nvJitLinkAddData(handle, NVJITLINK_INPUT_ANY, ltoirs[i].data(), ltoirs[i].size(), "cufft_generated_ltoir");
    }
    

    Hint

    To add the LTOIRs data to nvJitLink, one can simply use NVJITLINK_INPUT_ANY as the input type and do not have to worry about differentiating between LTOIRs and fatbinaries returned by the cuFFT library (see cufftDeviceCodeType).

Build Changes#

Since CUDA code compilation and linking are done at runtime, the build changes are minimal. Two adjustments are needed:

  • Link with the two extra dependencies, the cuFFT and the nvJitLink libraries, and

  • Define the CUFFTDX_ENABLE_CUFFT_DEPENDENCY macro for using the cufftdx::utils::get_database_and_ltoir function.

nvcc -DCUDA_INCLUDE_DIR=\"<cuda_include_path>\" \
     -DCUFFTDX_INCLUDE_DIRS=\"<cufftdx_include_path>\" \
     -DCOMMONDX_INCLUDE_DIR=\"<cufftdx_include_path>\" \
     -DCUFFTDX_ENABLE_CUFFT_DEPENDENCY \
     -L<cufft_lib_path> -I<cufft_include_path> \
     -I<cufftdx_include_path> \
     -lcufft \
     -lnvJitLink \
     -lnvrtc \
     -lcuda \
     03_nvrtc_fft_block_lto.cu -o 03_nvrtc_fft_block_lto