Notable differences with the single-process, multi-GPU API#
The following are a few notable differences between the single-process, multi-GPU cuFFT and cuFFTMp in terms of requirements and API usage.
Single-process, Multi-GPU (cuFFT)
Multi-processes (cuFFTMp)
cufftXtSetGPUsRequired
Not allowed
cufftXtMemcpy (...,CUFFT_COPY_HOST_TO_DEVICE)Copies the entire array from the CPU to multiple GPUs (in natural or permuted order)
Copies the local, distributed array from the CPU to the GPU without redistributing.
cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_HOST)Copies the entire array from multiple GPUs to the CPU (always in natural order)
Copies the local, distributed array from the GPU to the CPU without redistributing.
cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_DEVICE)Redistribute data between GPUs to/from natural order to/from permuted order
Redistribute data between GPUs to/from natural order to/from permuted order (not allowed with
cufftMpMakePlanDecomposition)Single-node interconnect
No restrictions
Peer-to-peer required
Descriptor memory
CUDA-visible
NVSHMEM-allocated
cufftMpMakePlanDecompositionNot allowed
Optional
With
desca pointer to acudaLibXtDescandnGPUsGPUs
desc->descriptor-> nGPUs == nGPUs
desc->descriptor-> nGPUs == 1Minimum size (with
nGPUsGPUs)32 in every dimension
nGPUsin the first two dimensions, 2 in the last dimension in 3D.Maximum number of GPUs
16
No limit
Batched transforms
Supported (but individual batches are not distributed across GPUs)
Not supported
cufftMpAttachCommNot allowed
Deprecated in cuFFTMp 11.4.0.
cufftXtSetDistributionNot allowed
Deprecated in cuFFTMp 11.4.0.