Memory requirementsΒΆ
In terms of memory usage, cuFFTMp requires a certain amount of scratch memory, depending on the individual dimensions sizes and the type of plan.
Assume we are using built-in Slabs
(i.e., CUFFT_XT_FORMAT_INPLACE and/or CUFFT_XT_FORMAT_INPLACE_SHUFFLED).
Also assume we are computing a transform with a total of S=X*Y*Z
elements over n gpus. Beyond the input/output descriptor
(of approximately S/n elements per GPU), cuFFT requires between
S/n and 2S/n elements of scratch per GPU.
When using custom data distributions (cufftXtSetDistribution with CUFFT_XT_FORMAT_DISTRIBUTED_INPUT
and/or CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT), the amount of scratch is larger.
For a transform of size S=X*Y*Z, the amount of scratch is approximately
S/nto2S/nwhen all GPUs are connected through peer-to-peer;3S/nto4S/notherwise.
Generally, minimum scratch usage can be obtained for pure powers of 2, 3, 5 and 7 and some composite sizes.
When one has multiple plans
(such as a C2R or R2C plan), it is possible to share scratch
between plans using cufftSetWorkArea and cufftGetSize.
Note
The C++ sample located in
r2c_c2r_shared_scratch shows an example of
scratch sharing between plans.