Notable differences with the single-process, multi-GPU APIΒΆ
The following are a few notable differences between the single-process, multi-GPU cuFFT and cuFFTMp in terms of requirements and API usage.
Single-process, Multi-GPU
Multi-processes (cuFFTMp)
cufftXtSetGPUsRequired
Not allowed
cufftMpAttachCommNot allowed
Required
cufftXtMemcpy (...,CUFFT_COPY_HOST_TO_DEVICE)Copies the entire array from the CPU to multiple GPUs (in natural or permuted order)
Copies the local, distributed array from the CPU to the GPU without redistributing.
cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_HOST)Copies the entire array from multiple GPUs to the CPU (always in natural order)
Copies the local, distributed array from the GPU to the CPU without redistributing.
cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_DEVICE)Redistribute data between GPUs to/from natural order to/from permuted order
Redistribute data between GPUs to/from natural order to/from permuted order (not allowed with
cufft XtSetDistribution)Single-node interconnect
No restrictions
Peer-to-peer required
Descriptor memory
CUDA-visible
NVSHMEM-allocated
cufftXtSetDistributionNot allowed
Optional
With
desca pointer to acudaLibXtDescandnGPUsGPUs
desc->descriptor-> nGPUs == nGPUs
desc->descriptor-> nGPUs == 1Minimum size (with
nGPUsGPUs)32 in every dimension
nGPUsin the first two dimensions, 2 in the last dimension in 3D.Maximum number of GPUs
16
No limit
Batched transforms
Supported (but individual batches are not distributed across GPUs)
Not supported