Notable differences with the single-process, multi-GPU API#

The following are a few notable differences between the single-process, multi-GPU cuFFT and cuFFTMp in terms of requirements and API usage.

Single-process, Multi-GPU (cuFFT)

Multi-processes (cuFFTMp)

cufftXtSetGPUs

Required

Not allowed

cufftXtMemcpy (...,CUFFT_COPY_HOST_TO_DEVICE)

Copies the entire array from the CPU to multiple GPUs (in natural or permuted order)

Copies the local, distributed array from the CPU to the GPU without redistributing.

cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_HOST)

Copies the entire array from multiple GPUs to the CPU (always in natural order)

Copies the local, distributed array from the GPU to the CPU without redistributing.

cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_DEVICE)

Redistribute data between GPUs to/from natural order to/from permuted order

Redistribute data between GPUs to/from natural order to/from permuted order (not allowed with cufftMpMakePlanDecomposition)

Single-node interconnect

No restrictions

Peer-to-peer required

Descriptor memory

CUDA-visible

NVSHMEM-allocated

cufftMpMakePlanDecomposition

Not allowed

Optional

With desc a pointer to a cudaLibXtDesc and nGPUs GPUs

desc->descriptor-> nGPUs == nGPUs

desc->descriptor-> nGPUs == 1

Minimum size (with nGPUs GPUs)

32 in every dimension

nGPUs in the first two dimensions, 2 in the last dimension in 3D.

Maximum number of GPUs

16

No limit

Batched transforms

Supported (but individual batches are not distributed across GPUs)

Not supported

cufftMpAttachComm

Not allowed

Deprecated in cuFFTMp 11.4.0.

cufftXtSetDistribution

Not allowed

Deprecated in cuFFTMp 11.4.0.

	Single-process, Multi-GPU (cuFFT)	Multi-processes (cuFFTMp)
`cufftXtSetGPUs`	Required	Not allowed
`cufftXtMemcpy (...,CUFFT_COPY_HOST_TO_DEVICE)`	Copies the entire array from the CPU to multiple GPUs (in natural or permuted order)	Copies the local, distributed array from the CPU to the GPU without redistributing.
`cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_HOST)`	Copies the entire array from multiple GPUs to the CPU (always in natural order)	Copies the local, distributed array from the GPU to the CPU without redistributing.
`cufftXtMemcpy (...,CUFFT_COPY_DEVICE_TO_DEVICE)`	Redistribute data between GPUs to/from natural order to/from permuted order	Redistribute data between GPUs to/from natural order to/from permuted order (not allowed with `cufftMpMakePlanDecomposition`)
Single-node interconnect	No restrictions	Peer-to-peer required
Descriptor memory	CUDA-visible	NVSHMEM-allocated
`cufftMpMakePlanDecomposition`	Not allowed	Optional
With `desc` a pointer to a `cudaLibXtDesc` and `nGPUs` GPUs	`desc->descriptor-> nGPUs == nGPUs`	`desc->descriptor-> nGPUs == 1`
Minimum size (with `nGPUs` GPUs)	32 in every dimension	`nGPUs` in the first two dimensions, 2 in the last dimension in 3D.
Maximum number of GPUs	16	No limit
Batched transforms	Supported (but individual batches are not distributed across GPUs)	Not supported
`cufftMpAttachComm`	Not allowed	Deprecated in cuFFTMp 11.4.0.
`cufftXtSetDistribution`	Not allowed	Deprecated in cuFFTMp 11.4.0.