API usage

Note

The CUDALibrarySamples repository on Github contains several examples showing how to use cuFFTMp in various situations. The reader is encouraged to explore those while reading this document.

Like in the single-process, multi-GPU API, input and output arrays in cuFFTMp are distributed across multiple GPUs. This is done using descriptors, which hold both a pointer to the local data and an enum describing how the data is distributed across multiple GPUs. Note that in cuFFTMp every descriptor only holds a pointer to the data local to its GPU, and not to other GPUs.

cuFFTMp supports two kinds of data distributions.

Built-in slabs

This is done using CUFFT_XT_FORMAT_INPLACE and CUFFT_XT_FORMAT_INPLACE_SHUFFLED. This is the default, optimized data distribution and is identical to the single process API. Consider an array of size XxYxZ distributed over n GPUs.

  • With CUFFT_XT_FORMAT_INPLACE, the first X%n GPUs each own (X/n+1)*Y*Z elements and the remaining GPUs each own (X/n)*Y*Z elements.

  • With CUFFT_XT_FORMAT_INPLACE_SHUFFLED, the first Y%n GPUs each own X*(Y/n+1)*Z elements and the remaining GPUs each own X*(Y/n)*Z elements.

Note that this data distribution does not allow strides between elements and assumes in-place transforms with an in-place data layout. This means a real dimension of length Z need to be padded to 2(Z/2+1) elements.

Flexible slabs and pencils

Using CUFFT_XT_FORMAT_DISTRIBUTED_INPUT and CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT. This lets the user use any Slab or Pencil data distribution (with arbitrary strides between elements). While efficient, this distribution less optimized than the previous one.

Usage with built-in slab data decompositions

When using the default and optimized data distribution, cuFFTMp’s API usage is similar to the single-process, multi-GPU cuFFT API usage. The following outlines the sequence of required API calls.

// Initialize MPI, create some distribute data
MPI_Init(...)
MPI_Comm comm = MPI_COMM_WORLD;

// Initialize a cuFFT handle
cufftHandle plan = 0;
cufftCreate(&plan);

// Attach an MPI communicator
cufftMpAttachComm(plan, CUFFT_COMM_MPI, &comm);

// Create a plan
size_t workspace;
cufftMakePlan3d(plan, nx, ny, nz, CUFFT_C2C, &workspace);

// Allocate a multi-GPU descriptor
cudaLibXtDesc *desc;
cufftXtMalloc(plan, &desc, CUFFT_XT_FORMAT_INPLACE);

// Copy data from the CPU to the GPU.
// The CPU data is distributed according to CUFFT_XT_FORMAT_INPLACE
cufftXtMemcpy(plan, desc, cpu_data, CUFFT_COPY_HOST_TO_DEVICE);

// Compute a forward FFT
cufftXtExecDescriptor(plan, desc, desc, CUFFT_FORWARD);

// Copy memory back to the CPU. Data is now distributed according to CUFFT_XT_FORMAT_INPLACE_SHUFFLED
cufftXtMemcpy(plan, cpu_data, desc, CUFFT_COPY_DEVICE_TO_HOST);

// Free the multi-GPU descriptor and the plan and finalize MPI
cufftXtFree(desc);
cufftDestroy(plan);
MPI_Finalize();

This code snippet does the following

  1. Initialize MPI using MPI_Init, create a distributed 2D or 3D array (in natural or permuted order) on CPU.

  2. (Optional) On multi-GPU systems, select a GPU using cudaSetDevice.

  3. Create an empty plan cufftCreate(&plan)

  4. Attach the MPI communicator comm to the plan, indicating to cuFFT to enable the multi-process functionalities

    cufftMpAttachComm(plan, CUFFT_COMM_MPI, comm)
    
  5. (Optional) Attach a stream cufftSetStream(plan, stream)

  6. Make the plan. This creates a plan for a 3D complex-to-complex transform of size nx*ny*nz

    cufftMakePlan3d(plan, nx, ny, nz, CUFFT_C2C, workspace)
    

    and this creates a plan for a 2D complex-to-complex transform of size nx*ny

    cufftMakePlan2d(plan, nx, ny, CUFFT_C2C, workspace)
    

    Since cuFFTMp only supports in-place transforms without strides, cufftMakePlanMany should not be used. Note that this is a collective call that should be called by all processes within the MPI communicator.

  7. Allocate GPU memory to hold the data using cufftXtMalloc(plan, desc, CUFFT_XT_FORMAT_INPLACE). CUFFT_XT_FORMAT_INPLACE indicates that the data is distributed according to the natural order. CUFFT_XT_FORMAT_INPLACE_SHUFFLED can be used to allocate data in permuted order. The memory is allocated in desc->descriptor->data[0].

  8. Copy data from the CPU to the GPU using cufftXtMemcpy(plan, desc, cpu_data, CUFFT_COPY_HOST_TO_DEVICE). Note that the data can also be initialized on GPU from the beginning, and using cufftXtMemcpy is merely a convenience function. GPU data can be directly accessed through the descriptor by using the device memory pointer desc->descriptor->data[0].

  1. Run the transform

cufftXtExecDescriptor(plan, desc, desc, CUFFT_FORWARD)
  1. At this point, data is distributed in the permuted order. If required, element-wise operations, filtering, a backward transform, etc. can be performed.

  2. Copy data back from the GPU to the CPU using cufftXtMemcpy(plan, cpu_data, desc, CUFFT_COPY_DEVICE_TO_HOST) This does not change the data distribution. If the GPU data is distributed according to CUFFT_XT_FORMAT_INPLACE (or CUFFT_XT_FORMAT_INPLACE_SHUFFLED), the same holds for the CPU data.

  3. Free the descriptor cufftXtFree(desc)

  4. Destroy the plan cufftDestroy(plan)

  5. Finalize MPI using MPI_Finalize

Warning

cufftMakePlanMany and cufftXtMakePlanMany should not be used with cuFFTMp. If they are used, any stride information will be ignored

Finally, you can build and run as follows:

  • Include the cufftMp.h header

  • Compile the application and link to both cuFFTMp (libcufftMp.so) and the CUDA driver (libcuda.so). Note that statically linking to NVSHMEM (libnvshmem.a) is not allowed. Also note that libcufftMp.so contains both the single and multi-process cuFFT functionalities and it is not necessary to link to libcufft.so as well.

    mpicxx app.cpp -I/path/to/cufft/include -L/path/to/cufft/lib -L/path/to/cuda/lib -lcufftMp -lcuda -o app
    
  • At runtime, ensure that the NVSHMEM bootstrap (nvshmem_bootstrap_mpi.so) is present in LD_LIBRARY_PATH

    export LD_LIBRARY_PATH=/path/to/nvshmem/lib:$LD_LIBRARY_PATH
    
  • If necessary, increase the amount of memory reserved for NVSHMEM by setting the NVSHMEM_SYMMETRIC_SIZE environment variable to, e.g., 10GB:

    export NVSHMEM_SYMMETRIC_SIZE=10G
    
  • Run the application using MPI, e.g.,

    mpirun -n 2 ./app
    

Usage with custom slabs & pencils data decompositions

cuFFTMp also supports arbitrary data distributions in the form of 3D boxes. Consider a X*Y*Z global array. 3D boxes are used to describe a subsection of this global array by indicating the lower and upper corner of the subsection. By associating boxes to processes one can then describe a data distribution where every process owns a contiguous rectangular subsection of the global array. For instance, consider a 2D case with a global array of size X*Y = 4x4 and three boxes, described as box = {lower, upper}

  • Box 0 = { {0,0}, {2,2} } (in green)

  • Box 1 = { {2,0}, {4,2} } (in blue)

  • Box 2 = { {0,2}, {4,4} } (in purple)

Illustration of a data decomposition of a 4x4 world box into three boxes.

By associating box 0 to process 0, box 1 to process 1 and box 2 to process 2, this creates a data distribution of the global 4*4 array across three processes. The same process can be generalized to a 3D array.

This logic can be used to generalize the usage from the previous section and use any data distribution. This requires the use of the cufftXtSetDistribution(plan, &input_box, &output_box) function. With this function, cufftXtExecDescriptor(plan, ...) can accept descriptors with a data distribution of type CUFFT_XT_FORMAT_DISTRIBUTED_INPUT or CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT. For a given plan,

  • CUFFT_XT_FORMAT_DISTRIBUTED_INPUT indicates that the data in the descriptor is distributed according to input_box;

  • CUFFT_XT_FORMAT_DISTRIBUTED_OUTOUT indicates that the data in the descriptor is distributed according to output_box.

Note that, in addition,

  • R2C plans only support CUFFT_XT_FORMAT_DISTRIBUTED_INPUT, i.e., input_box should describe the real data distribution and output_box the complex data distribution;

  • C2R plans only support CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT, i.e., input_box should describe the real data distribution and output_box the complex data distribution.

This implies that the proper usage for a R2C followed-by C2R plan is

real_box = { ... };
complex_box = { ... }
cufftXtSetDistribution(plan_r2c, &real_box, &complex_box);
cufftXtSetDistribution(plan_c2r, &real_box, &complex_box);
...
cufftXtExecDescriptor(plan_r2c, desc, desc, CUFFT_FORWARD);
...
cufftXtExecDescriptor(plan_r2c, desc, desc, CUFFT_INVERSE);

Overall, for a C2C plan, this leads to the following:

// Initialize MPI, create some distribute data
MPI_Init(...)
MPI_Comm comm = MPI_COMM_WORLD;

// Create plan
cufftHandle plan = 0;
cufftCreate(&plan);
cufftMpAttachComm(plan, CUFFT_COMM_MPI, &mpi_comm);

// Initialize and pass boxes to the plan
cufftBox3d input_box, output_box;
cufftXtSetDistribution(plan, &input_box, &output_box);

// Make the cuFFT plan
size_t scratch;
cufftMakePlan3d(plan, nx, ny, nz, CUFFT_C2C, &scratch);

// Allocate a multi-GPU descriptor
cudaLibXtDesc *desc;
cufftXtMalloc(plan, &desc, CUFFT_XT_FORMAT_INPLACE);

// Copy data from the CPU to the GPU.
// The CPU data is distributed according to CUFFT_XT_FORMAT_DISTRIBUTED_INPUT (i.e., input_box)
cufftXtMemcpy(plan, desc, cpu_data, CUFFT_COPY_HOST_TO_DEVICE);

// Compute a forward FFT
cufftXtExecDescriptor(plan, desc, desc, CUFFT_FORWARD);

// Copy memory back to the CPU. Data is now distributed according to CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT (i.e., output_box)
cufftXtMemcpy(plan, cpu_data, desc, CUFFT_COPY_DEVICE_TO_HOST);

// Free the multi-GPU descriptor and the plan and finalize MPI
cufftXtFree(desc);
cufftDestroy(plan);
MPI_Finalize();
  1. Create an empty plan cufftCreate(&plan)

  2. Attach the MPI communicator to the plan, indicating to cuFFT to enable the multi-process functionalities cufftMpAttachComm(plan, CUFFT_COMM_MPI, comm)

  3. (Optional) Attach a stream cufftSetStream(plan, stream)

  4. Describe the data distribution. First define the input and output boxes, describing the subsection of the global array owned by this process. Then, call cufftXtSetDistribution to register the boxes with cuFFT

    cufftBox3d input_box, output_box;
    input_box.lower = {...}     // A size_t[3] array
    input_box.upper = {...}     // A size_t[3] array
    input_box.strides = {...}   // A size_t[3] array
    output_box.lower = {...}    // A size_t[3] array
    output_box.upper = {...}    // A size_t[3] array
    output_box.strides = {...}  // A size_t[3] array
    cufftXtSetDistribution(plan, &input_box, &output_box)
    
  5. Make the plan with

    cufftMakePlan3d(plan, nx, ny, nz, CUFFT_C2C, &scratch)
    

    This creates a plan for a 3D transform of (global) size nx*ny*nz.

  6. Allocate GPU memory to hold the data. CUFFT_XT_FORMAT_DISTRIBUTED_INPUT indicates that the data is distributed according to input_box

    cufftXtMalloc(plan, desc, CUFFT_XT_FORMAT_DISTRIBUTED_INPUT)
    
  7. Copy data from the CPU to the GPU using

    cufftXtMemcpy(plan, desc, cpu_data, CUFFT_COPY_HOST_TO_DEVICE)
    

    Note that using cufftXtMemcpy is optional. Instead, data can be initialized and manipulated on the device directly. Also not that cufftXtMemcpy copies the entire CPU buffer, effectively ignoring the all but the first-dimension strides.

  8. Run the transform

    cufftXtExecDescriptor(plan, desc, desc, CUFFT_FORWARD)
    
  9. Copy data back from the GPU to the CPU. At this point, data is distributed according to CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT, i.e., output_box.

    cufftXtMemcpy(plan, cpu_data, desc, CUFFT_COPY_DEVICE_TO_HOST)
    
  10. Free the descriptor with cufftXtFree(desc)

  11. Destroy the plan with cufftDestroy(plan)

Reshape API

cuFFT exposes a distributed reshape API which enables redistributing data described using 3D boxes without using a cuFFT plan, descriptors or cufftXtMemcpy. Its usage follows the general cuFFT usage but with a new handle type.

// Initialize MPI
MPI_Init(...);
MPI_Comm comm = MPI_COMM_WORLD;

// Define input and output boxes
cufftBox3d in_box = { ... }
cufftBox3d out_box = { ... }

// Allocate reshape, attach MPI communicator
cufftReshapeHandle handle;
cufftMpCreateReshape(&handle);
cufftMpAttachReshapeComm(handle, CUFFT_COMM_MPI, &comm);

// Build reshape
cufftMpMakeReshape(handle, sizeof(int), &in_box, &out_box);

// Create a stream
cudaStream_t stream;
cudaStreamCreate(&stream);

// Retreive workspace size
size_t workspace;
cufftMpGetReshapeSize(handle, &workspace);

// Allocate NVSHMEM memory
void *src, *dst, *scratch;
_cufftMpNvshmemMalloc(input_size  * sizeof(int), &src));
_cufftMpNvshmemMalloc(output_size * sizeof(int), &dst));
_cufftMpNvshmemMalloc(workspace,                 &scratch));

// Apply reshape
CUFFT_CHECK(cufftMpExecReshapeAsync(handle, dst, src, scratch, stream));

// Cleanup
_cufftMpNvshmemFree(src);
_cufftMpNvshmemFree(dst);
_cufftMpNvshmemFree(scratch);
cufftMpDestroyReshape(handle);
cudaStreamDestroy(stream);
MPI_Finalize();
  1. Create an empty reshape handle, cufftMpCreateReshape(handle)

  2. Attach the MPI communicator, cufftMpAttachReshapeComm(handle, CUFFT_COMM_MPI, comm)

  3. Make the reshape

    cufftMpMakeReshape(handle, element_size, input_box, output_box)
    

    input_box (one on each process) describes the input distribution, and output_box describes the output distribution. element_size is the size, in bytes, of a single element. This is a collective call.

  4. Retrieve the workspace size, in bytes using cufftMpGetReshapeSize(handle, workspace).

  5. Allocate workspace using NVSHMEM

  6. Execute the reshape with

    cufftMpExecReshapeAsync(handle, dst, src, workspace, stream)
    

    This is a stream-ordered, collective call. dst, src, workspace should all be pointers to a symmetric-heap, NVSHMEM-allocated memory buffer. Note that this differs from MPI, where dst, src, workspace would be regular pointers to cudaMalloc’ed memory. See also NVSHMEM’s documentation on the symmetric heap here.

  7. Cleanup and destroy the reshape with cufftMpDestroyReshape(handle).

Note

Those functions require the use of NVSHMEM to allocate memory. To facilitate this, cuFFT defines (but does not expose in the headers) the following functions. For more information, see the NVSHMEM documentation.

extern "C" cufftResult _cufftMpNvshmemMalloc(size_t size, void **buff);
extern "C" cufftResult _cufftMpNvshmemFree(void *buff);

Those functions may be removed in future releases of cuFFT in favor of using the NVSHMEM API directly.

Note that symmetric-heap NVSHMEM-allocated buffers have a different semantic as local buffers In particular, dst, src, workspace should all refer to the same symmetric object allocated on all GPUs. This is different than when passing pointers to a message-passing based library like MPI, where pointers refer only to local objects.

Helper functions

Like in the single-process, multi-GPU API, cuFFTMp provides helper functions to help with managing multiple GPUs memory. Their usage is similar to the single-process case.

  • cufftXtMalloc is used to allocate memory on multiple GPUs in multiple processes. It takes an argument indicating which data decomposition.

  • cufftXtMemcpy is used to copy memory between CPU and GPU or between GPUs. Its usage depends on the source and destination of the data transfer:

    • cufftXtMemcpy(plan, dst, src, CUFFT_COPY_HOST_TO_DEVICE) copies distributed CPU data to GPU, keeping the distribution unchanged. This means the CPU data has to be distributed in the same way as the GPU data.

    • cufftXtMemcpy(plan, dst, src, CUFFT_COPY_DEVICE_TO_HOST) copies distributed GPU data to CPU, keeping the distribution unchanged. This means the CPU data will be distributed in the same way as the GPU data.

    • cufftXtMemcpy(plan, dst, src, CUFFT_COPY_DEVICE_TO_DEVICE) can be used to re-distribute GPU data from CUFFT_XT_FORMAT_INPUT_SHUFFLED to CUFFT_XT_FORMAT_INPUT (or vice-versa). This can be useful to transform permuted data back to natural order for easier post-processing.

  • cufftXtFree is used to free a descriptor