Memory requirementsΒΆ

In terms of memory usage, cuFFTMp requires a certain amount of scratch memory, depending on the individual dimensions sizes and the type of plan.

Assume we are using built-in Slabs (i.e., CUFFT_XT_FORMAT_INPLACE and/or CUFFT_XT_FORMAT_INPLACE_SHUFFLED). Also assume we are computing a transform with a total of S=X*Y*Z elements over n gpus. Beyond the input/output descriptor (of approximately S/n elements per GPU), cuFFT requires between S/n and 2S/n elements of scratch per GPU.

When using custom data distributions (cufftXtSetDistribution with CUFFT_XT_FORMAT_DISTRIBUTED_INPUT and/or CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT), the amount of scratch is larger. For a transform of size S=X*Y*Z, the amount of scratch is approximately

  • S/n to 2S/n when all GPUs are connected through peer-to-peer;

  • 3S/n to 4S/n otherwise.

Generally, minimum scratch usage can be obtained for pure powers of 2, 3, 5 and 7 and some composite sizes.

When one has multiple plans (such as a C2R or R2C plan), it is possible to share scratch between plans using cufftSetWorkArea and cufftGetSize.


The C++ sample located in r2c_c2r_shared_scratch shows an example of scratch sharing between plans.