Other Methods#

Frontend to Backend Traits Conversion#

cufftdx::utils::frontend_to_backend(...) converts a frontend FFT description to a effective backend traits that is used to query the cuFFTDx databases.

In combination with cuFFT Device API, this function can be used to generate the LTO database containing both device function code and metadata (a C++ header file) for a specified FFT operation. See Custom LTO Helper for an example.

#include "cufftdx/utils.hpp"

namespace cufftdx {
   namespace utils {
      struct backend_impl_traits {
         unsigned int  size;
         fft_type      type;
         fft_direction direction;
         unsigned int  sm;
         unsigned int  elements_per_thread;
         unsigned int  min_elements_per_thread;
      };

      enum class algorithm {
         ct,
         bluestein,
      };

      enum class execution_type {
         thread,
         block,
      };

      backend_impl_traits backend_traits =
         frontend_to_backend(
            algorithm algo,
            execution_type exec_type,
            unsigned int fft_size
                /* size_of<FFT>::value */,
            fft_type type
                /* type_of<FFT>::value */,
            fft_direction direction
                /* direction_of<FFT>::value */,
            unsigned int sm
                /* sm_of<FFT>::value */,
            real_mode real_mode
                /* real_mode_of<FFT>::value */,
            unsigned int elements_per_thread
                /* elements_per_thread_of<FFT>::value or
                 * 0 if not set */,
            unsigned int block_dim_x
                /* block_dim_of<FFT>::x or 0 if not set */,
            experimental::code_type code_type
                /* experimental::code_type_of<FFT>::value */
         );
   } // namespace utils
} // namespace cufftdx

(online) LTO Database Creation#

cufftdx::utils::get_database_and_ltoir() returns a tuple containing:

  • Database string (std::string) to be inserted into the cuFFTDx headers.

  • Vector of LTOIRs (std::vector<std::vector<char>>) for building device functions for the specified FFT operation.

  • Required CUDA block dimensions (Dim3) for executing the FFT operation.

  • Required shared memory size (unsigned, in bytes) for executing the FFT operation.

std::tuple<std::string, std::vector<std::vector<char>>, Dim3,
           unsigned int>
cufftdx::utils::get_database_and_ltoir(
    unsigned int                    fft_size,
    cufftdx::fft_direction          dir,
    cufftdx::fft_type               type,
    unsigned int                    sm,
    cufftdx::detail::execution_type execution,
    cufftdx::precision              prec =
        cufftdx::precision::f32,
    cufftdx::complex_layout         layout =
        cufftdx::complex_layout::natural,
    cufftdx::real_mode              rmode =
        cufftdx::real_mode::normal,
    unsigned int                    fft_ept = 0
        /* use heuristic */,
    unsigned int                    ffts_per_block = 1
        /* 0: use suggested ffts_per_block */);

Example

// Assuming the following FFT operator is defined in the NVRTC-compiled code:
// using FFT = decltype(cufftdx::Block() +
//                      cufftdx::Size<128>() +
//                      cufftdx::Type<cufftdx::fft_type::c2c>() +
//                      cufftdx::Direction<cufftdx::fft_direction::forward>() +
//                      cufftdx::Precision<float>() +
//                      cufftdx::ElementsPerThread<8>() +
//                      cufftdx::FFTsPerBlock<2>() +
//                      cufftdx::SM<700>());

// You can get the database string and LTOIRs for the FFT operation by calling:
auto [lto_db, ltoirs, block_dim, sm_size] =
   cufftdx::utils::get_database_and_ltoir(128,
                                          cufftdx::fft_direction::forward,
                                          cufftdx::fft_type::c2c,
                                          700,
                                          cufftdx::detail::execution_type::block,
                                          cufftdx::precision::f32,
                                          cufftdx::complex_layout::natural,
                                          cufftdx::real_mode::normal,
                                          8,
                                          2);

After obtaining the database string and LTOIRs, you can insert the database string into the cuFFTDx header file and link the LTOIRs to the user code as shown in Use Case II: Online Kernel Generation.

Note

The cufftdx::utils::get_database_and_ltoir() function is a wrapper around cuFFT Device APIs (see cuFFT Device API Reference). To use this function:

  1. Define CUFFTDX_ENABLE_CUFFT_DEPENDENCY.

  2. Link against the cuFFT library.

Shared Memory Compute For Dynamic Batching#

cufftdx::experimental::utils::get_shared_memory_size_for_dynamic_batching<FFT>(const unsigned int ffts_per_block) computes the total shared memory bytes required for computing multiple FFTs with the DynamicBatching operator.

template<class FFT>
constexpr unsigned int
get_shared_memory_size_for_dynamic_batching(
    const unsigned int ffts_per_block);

This function is useful when you need to determine shared memory requirements at runtime, particularly when the number of FFTs per block varies dynamically.

Parameters:

  • shared_memory_size_per_fft - Shared memory size required for a single FFT (obtained from :ref:`FFT::shared_memory_size`<sharedmemory-block-trait-label>)

  • ffts_per_block - Number of FFTs to compute per block (user-defined, can be a runtime value)

  • implicit_type_batching - Number of FFTs batched per value type (obtained from :ref:`FFT::implicit_type_batching`<_implicit-type-batching-block-trait-label>)

cufftdx::experimental::utils::get_shared_memory_size_for_dynamic_batching(const unsigned int shared_memory_size_per_fft, const unsigned int ffts_per_block, const unsigned int implicit_type_batching) provides the same functionality at runtime, without creating a FFT Description.

constexpr unsigned int
get_shared_memory_size_for_dynamic_batching(
    const unsigned int shared_memory_size_per_fft,
    const unsigned int ffts_per_block,
    const unsigned int implicit_type_batching);

Parameters:

  • shared_memory_size_per_fft - Shared memory size required for a single FFT (obtained from :ref:`FFT::shared_memory_size`<sharedmemory-block-trait-label>)

  • ffts_per_block - Number of FFTs to compute per block (user-defined, can be a runtime value)

  • implicit_type_batching - Number of FFTs batched per value type (obtained from :ref:`FFT::implicit_type_batching`<_implicit-type-batching-block-trait-label>)

Example

using FFT = decltype(Size<128>() + Precision<float>() +
               Type<fft_type::c2c>() +
               Direction<fft_direction::forward>() +
               ElementsPerThread<8>() + DynamicBatching());

// Query compile-time traits
constexpr auto shared_memory_size_per_fft =
    FFT::shared_memory_size;
constexpr auto implicit_type_batching =
    FFT::implicit_type_batching;

// Runtime value chosen by user
unsigned int ffts_per_block = 8;

// Compute total shared memory requirement
auto total_shared_memory_size_bytes =
    cufftdx::experimental::utils::
        get_shared_memory_size_for_dynamic_batching(
            shared_memory_size_per_fft,
            ffts_per_block,
            implicit_type_batching);

auto total_shared_memory_size_bytes_with_desc =
    cufftdx::experimental::utils::
        get_shared_memory_size_for_dynamic_batching<FFT>(
            ffts_per_block);

Note

These functions only return a valid result when using traits from a description with the DynamicBatching operator. For obtaining the shared memory bytes required for other descriptions, see Shared Memory Size Trait.