Notable differences with the single-process, multi-GPU APIΒΆ
The following are a few notable differences between the single-process, multi-GPU cuFFT and cuFFTMp in terms of requirements and API usage.
Single-process, Multi-GPU
Multi-processes (cuFFTMp)
cufftXtSetGPUs
Required
Not allowed
cufftMpAttachComm
Not allowed
Required
cufftXtMemcpy (...,CUFFT_COPY_HOST_TO_DEVICE)
Copies the entire array from the CPU to multiple GPUs (in natural or permuted order)
Copies the local, distributed array from the CPU to the GPU without redistributing.
cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_HOST)
Copies the entire array from multiple GPUs to the CPU (always in natural order)
Copies the local, distributed array from the GPU to the CPU without redistributing.
cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_DEVICE)
Redistribute data between GPUs to/from natural order to/from permuted order
Redistribute data between GPUs to/from natural order to/from permuted order (not allowed with
cufft XtSetDistribution
)Single-node interconnect
No restrictions
Peer-to-peer required
Descriptor memory
CUDA-visible
NVSHMEM-allocated
cufftXtSetDistribution
Not allowed
Optional
With
desc
a pointer to acudaLibXtDesc
andnGPUs
GPUs
desc->descriptor-> nGPUs == nGPUs
desc->descriptor-> nGPUs == 1
Minimum size (with
nGPUs
GPUs)32 in every dimension
nGPUs
in the first two dimensions, 2 in the last dimension in 3D.Maximum number of GPUs
16
No limit
Batched transforms
Supported (but individual batches are not distributed across GPUs)
Not supported