Notable differences with the single-process, multi-GPU API¶

The following are a few notable differences between the single-process, multi-GPU cuFFT and cuFFTMp in terms of requirements and API usage.

Single-process, Multi-GPU

Multi-processes (cuFFTMp)

cufftXtSetGPUs

Required

Not allowed

cufftMpAttachComm

Not allowed

Required

cufftXtMemcpy (...,CUFFT_COPY_HOST_TO_DEVICE)

Copies the entire array from the CPU to multiple GPUs (in natural or permuted order)

Copies the local, distributed array from the CPU to the GPU without redistributing.

cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_HOST)

Copies the entire array from multiple GPUs to the CPU (always in natural order)

Copies the local, distributed array from the GPU to the CPU without redistributing.

cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_DEVICE)

Redistribute data between GPUs to/from natural order to/from permuted order

Redistribute data between GPUs to/from natural order to/from permuted order (not allowed with cufft XtSetDistribution)

Single-node interconnect

No restrictions

Peer-to-peer required

Descriptor memory

CUDA-visible

NVSHMEM-allocated

cufftXtSetDistribution

Not allowed

Optional

With desc a pointer to a cudaLibXtDesc and nGPUs GPUs

desc->descriptor-> nGPUs == nGPUs

desc->descriptor-> nGPUs == 1

Minimum size (with nGPUs GPUs)

32 in every dimension

nGPUs in the first two dimensions, 2 in the last dimension in 3D.

Maximum number of GPUs

16

No limit

Batched transforms

Supported (but individual batches are not distributed across GPUs)

Not supported