Notable differences with the single-process, multi-GPU API#
The following are a few notable differences between the single-process, multi-GPU cuFFT and cuFFTMp in terms of requirements and API usage.
Single-process, Multi-GPU (cuFFT)
Multi-processes (cuFFTMp)
cufftXtSetGPUs
Required
Not allowed
cufftXtMemcpy (...,CUFFT_COPY_HOST_TO_DEVICE)
Copies the entire array from the CPU to multiple GPUs (in natural or permuted order)
Copies the local, distributed array from the CPU to the GPU without redistributing.
cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_HOST)
Copies the entire array from multiple GPUs to the CPU (always in natural order)
Copies the local, distributed array from the GPU to the CPU without redistributing.
cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_DEVICE)
Redistribute data between GPUs to/from natural order to/from permuted order
Redistribute data between GPUs to/from natural order to/from permuted order (not allowed with
cufftMpMakePlanDecomposition
)Single-node interconnect
No restrictions
Peer-to-peer required
Descriptor memory
CUDA-visible
NVSHMEM-allocated
cufftMpMakePlanDecomposition
Not allowed
Optional
With
desc
a pointer to acudaLibXtDesc
andnGPUs
GPUs
desc->descriptor-> nGPUs == nGPUs
desc->descriptor-> nGPUs == 1
Minimum size (with
nGPUs
GPUs)32 in every dimension
nGPUs
in the first two dimensions, 2 in the last dimension in 3D.Maximum number of GPUs
16
No limit
Batched transforms
Supported (but individual batches are not distributed across GPUs)
Not supported
cufftMpAttachComm
Not allowed
Deprecated in cuFFTMp 11.4.0.
cufftXtSetDistribution
Not allowed
Deprecated in cuFFTMp 11.4.0.