LTO Examples#

The cuFFTDx + cuFFT LTO EA package provides multiple samples which demonstrate how to use the features enabled by LTO.

Examples

Group

Example

Description

Subgroup

Introduction Examples

09_introduction_lto_example

00_introduction_lto_example

(offline) cuFFTDx LTO introduction

04_nvrtc_fft

03_nvrtc_fft_block_lto

(online) cuFFTDx LTO introduction

10_cufft_device_api_example

00_cufft_device_api_lto_example

(offline) cuFFT Device API introduction

Simple FFT Examples

01_simple_fft_thread

02_simple_fft_thread_lto

(offline) Complex-to-complex (C2C) thread FFT using LTO

02_simple_fft_block_lto

10_simple_fft_block_c2r_lto

(offline) Complex-to-real block FFT using LTO

NVRTC Examples (additional)

02_nvrtc_fft_thread_lto

(online) Complex-to-complex thread FFT using LTO

FFT Performance

02_block_fft_lto_ptx_performance

(offline) Benchmark for C2C block FFT (LTO vs PTX)

For more information about “online” vs “offline” annotated in the table above, see Augment cuFFTDx with LTO.

Introduction Examples#

  • 00_introduction_lto_example

  • 03_nvrtc_fft_block_lto

  • 00_cufft_device_api_lto_example

Examples used in the documentation demonstrating how to adopt LTO features for existing cuFFTDx projects. 00_introduction_lto_example and 03_nvrtc_fft_block_lto are used in the Augment cuFFTDx with LTO section to showcase the steps for offline and online kernel generation, respectively. In 00_cufft_device_api_lto_example, used in the Custom LTO Helper section, we demonstrate how to write a custom LTO database helper tool using the cuFFT Device API, as an alternative to the LTO Helper.

Simple FFT Examples#

  • 02_simple_fft_thread_lto

  • 10_simple_fft_block_c2r_lto

The LTO version of the 02_simple_fft_thread and 10_simple_fft_block_c2r samples, respectively. See Simple FFT Examples from the official cuFFTDx documentation for more details.

NVRTC Examples#

  • 02_nvrtc_fft_thread_lto

  • 03_nvrtc_fft_block_lto

The LTO versions of the existing NVRTC examples presenting how to use cuFFTDx on thread and block level with NVRTC and nvJitLink. See NVRTC Examples from the official cuFFTDx documentation for more details.

FFT Performance#

  • 02_block_fft_lto_ptx_performance

The example compares the performance of cuFFTDx device functions calculating FFT using the inlined-PTX implementation vs. the LTOIR implementation.

FFT performance with PTX and LTOIR implementations on H100 80GB with maximum clocks set.

Fig. 1 Performance comparison between LTOIR and PTX implementations of a single-precision complex-to-complex forward FFT. Tests were performed on H100 80GB with maximum clocks set.#