API usage¶
Note
The CUDALibrarySamples
repository on Github contains
several examples showing how to use cuFFTMp in various situations.
The reader is encouraged to explore those while reading this document.
Like in the single-process, multi-GPU API, input and output arrays in cuFFTMp are distributed across multiple GPUs. This is done using descriptors, which hold both a pointer to the local data and an enum describing how the data is distributed across multiple GPUs. Note that in cuFFTMp every descriptor only holds a pointer to the data local to its GPU, and not to other GPUs.
cuFFTMp supports two kinds of data distributions.
Built-in slabs
This is done using CUFFT_XT_FORMAT_INPLACE
and CUFFT_XT_FORMAT_INPLACE_SHUFFLED
.
This is the default, optimized data distribution and is identical to the single process API.
Consider an array of size XxYxZ
distributed over n
GPUs.
With
CUFFT_XT_FORMAT_INPLACE
, the firstX%n
GPUs each own(X/n+1)*Y*Z
elements and the remaining GPUs each own(X/n)*Y*Z
elements.With
CUFFT_XT_FORMAT_INPLACE_SHUFFLED
, the firstY%n
GPUs each ownX*(Y/n+1)*Z
elements and the remaining GPUs each ownX*(Y/n)*Z
elements.
Note that this data distribution does not allow strides between elements and assumes in-place transforms with an in-place data layout. This means a real dimension of length Z
need to be padded to 2(Z/2+1)
elements.
Flexible slabs and pencils
Using CUFFT_XT_FORMAT_DISTRIBUTED_INPUT
and CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT
.
This lets the user use any Slab or Pencil data distribution (with arbitrary strides between elements).
While efficient, this distribution less optimized than the previous one.
Usage with built-in slab data decompositions¶
When using the default and optimized data distribution, cuFFTMp’s API usage is similar to the single-process, multi-GPU cuFFT API usage. The following outlines the sequence of required API calls.
// Initialize MPI, create some distribute data
MPI_Init(...)
MPI_Comm comm = MPI_COMM_WORLD;
// Initialize a cuFFT handle
cufftHandle plan = 0;
cufftCreate(&plan);
// Attach an MPI communicator
cufftMpAttachComm(plan, CUFFT_COMM_MPI, &comm);
// Create a plan
size_t workspace;
cufftMakePlan3d(plan, nx, ny, nz, CUFFT_C2C, &workspace);
// Allocate a multi-GPU descriptor
cudaLibXtDesc *desc;
cufftXtMalloc(plan, &desc, CUFFT_XT_FORMAT_INPLACE);
// Copy data from the CPU to the GPU.
// The CPU data is distributed according to CUFFT_XT_FORMAT_INPLACE
cufftXtMemcpy(plan, desc, cpu_data, CUFFT_COPY_HOST_TO_DEVICE);
// Compute a forward FFT
cufftXtExecDescriptor(plan, desc, desc, CUFFT_FORWARD);
// Copy memory back to the CPU. Data is now distributed according to CUFFT_XT_FORMAT_INPLACE_SHUFFLED
cufftXtMemcpy(plan, cpu_data, desc, CUFFT_COPY_DEVICE_TO_HOST);
// Free the multi-GPU descriptor and the plan and finalize MPI
cufftXtFree(desc);
cufftDestroy(plan);
MPI_Finalize();
This code snippet does the following
Initialize MPI using
MPI_Init
, create a distributed 2D or 3D array (in natural or permuted order) on CPU.(Optional) On multi-GPU systems, select a GPU using
cudaSetDevice
.Create an empty plan
cufftCreate(&plan)
Attach the MPI communicator
comm
to the plan, indicating to cuFFT to enable the multi-process functionalitiescufftMpAttachComm(plan, CUFFT_COMM_MPI, comm)
(Optional) Attach a stream
cufftSetStream(plan, stream)
Make the plan. This creates a plan for a 3D complex-to-complex transform of size
nx*ny*nz
cufftMakePlan3d(plan, nx, ny, nz, CUFFT_C2C, workspace)
and this creates a plan for a 2D complex-to-complex transform of size
nx*ny
cufftMakePlan2d(plan, nx, ny, CUFFT_C2C, workspace)
Since cuFFTMp only supports in-place transforms without strides,
cufftMakePlanMany
should not be used. Note that this is a collective call that should be called by all processes within the MPI communicator.Allocate GPU memory to hold the data using
cufftXtMalloc(plan, desc, CUFFT_XT_FORMAT_INPLACE)
.CUFFT_XT_FORMAT_INPLACE
indicates that the data is distributed according to the natural order.CUFFT_XT_FORMAT_INPLACE_SHUFFLED
can be used to allocate data in permuted order. The memory is allocated indesc->descriptor->data[0]
.Copy data from the CPU to the GPU using
cufftXtMemcpy(plan, desc, cpu_data, CUFFT_COPY_HOST_TO_DEVICE)
. Note that the data can also be initialized on GPU from the beginning, and usingcufftXtMemcpy
is merely a convenience function. GPU data can be directly accessed through the descriptor by using the device memory pointerdesc->descriptor->data[0]
.
Run the transform
cufftXtExecDescriptor(plan, desc, desc, CUFFT_FORWARD)
At this point, data is distributed in the permuted order. If required, element-wise operations, filtering, a backward transform, etc. can be performed.
Copy data back from the GPU to the CPU using
cufftXtMemcpy(plan, cpu_data, desc, CUFFT_COPY_DEVICE_TO_HOST)
This does not change the data distribution. If the GPU data is distributed according toCUFFT_XT_FORMAT_INPLACE
(orCUFFT_XT_FORMAT_INPLACE_SHUFFLED
), the same holds for the CPU data.Free the descriptor
cufftXtFree(desc)
Destroy the plan
cufftDestroy(plan)
Finalize MPI using
MPI_Finalize
Warning
cufftMakePlanMany
and cufftXtMakePlanMany
should not be used with cuFFTMp.
If they are used, any stride information will be ignored
Finally, you can build and run as follows:
Include the
cufftMp.h
headerCompile the application and link to both cuFFTMp (
libcufftMp.so
) and the CUDA driver (libcuda.so
). Note that statically linking to NVSHMEM (libnvshmem.a
) is not allowed. Also note thatlibcufftMp.so
contains both the single and multi-process cuFFT functionalities and it is not necessary to link tolibcufft.so
as well.mpicxx app.cpp -I/path/to/cufft/include -L/path/to/cufft/lib -L/path/to/cuda/lib -lcufftMp -lcuda -o app
At runtime, ensure that the NVSHMEM bootstrap (
nvshmem_bootstrap_mpi.so
) is present inLD_LIBRARY_PATH
export LD_LIBRARY_PATH=/path/to/nvshmem/lib:$LD_LIBRARY_PATH
If necessary, increase the amount of memory reserved for NVSHMEM by setting the
NVSHMEM_SYMMETRIC_SIZE
environment variable to, e.g., 10GB:export NVSHMEM_SYMMETRIC_SIZE=10G
Run the application using MPI, e.g.,
mpirun -n 2 ./app
Usage with custom slabs & pencils data decompositions¶
cuFFTMp also supports arbitrary data distributions in the form of 3D boxes.
Consider a X*Y*Z
global array.
3D boxes are used to describe a subsection of this global array by indicating the lower and upper corner of the subsection.
By associating boxes to processes one can then describe a data distribution where every process owns a contiguous rectangular subsection of the global array.
For instance, consider a 2D case with a global array of size X*Y = 4x4
and three boxes, described as box = {lower, upper}
Box 0 = { {0,0}, {2,2} }
(in green)Box 1 = { {2,0}, {4,2} }
(in blue)Box 2 = { {0,2}, {4,4} }
(in purple)
By associating box 0 to process 0, box 1 to process 1 and box 2 to process 2, this creates a data distribution of the global 4*4
array across three processes.
The same process can be generalized to a 3D array.
This logic can be used to generalize the usage from the previous section and use any data distribution.
This requires the use of the cufftXtSetDistribution(plan, &input_box, &output_box)
function.
With this function, cufftXtExecDescriptor(plan, ...)
can accept descriptors with a data distribution of type CUFFT_XT_FORMAT_DISTRIBUTED_INPUT
or
CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT
.
For a given plan
,
CUFFT_XT_FORMAT_DISTRIBUTED_INPUT
indicates that the data in the descriptor is distributed according toinput_box
;CUFFT_XT_FORMAT_DISTRIBUTED_OUTOUT
indicates that the data in the descriptor is distributed according tooutput_box
.
Note that, in addition,
R2C
plans only supportCUFFT_XT_FORMAT_DISTRIBUTED_INPUT
, i.e.,input_box
should describe the real data distribution andoutput_box
the complex data distribution;C2R
plans only supportCUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT
, i.e.,input_box
should describe the real data distribution andoutput_box
the complex data distribution.
This implies that the proper usage for a R2C
followed-by C2R
plan is
real_box = { ... }; complex_box = { ... } cufftXtSetDistribution(plan_r2c, &real_box, &complex_box); cufftXtSetDistribution(plan_c2r, &real_box, &complex_box); ... cufftXtExecDescriptor(plan_r2c, desc, desc, CUFFT_FORWARD); ... cufftXtExecDescriptor(plan_r2c, desc, desc, CUFFT_INVERSE);
Overall, for a C2C plan, this leads to the following:
// Initialize MPI, create some distribute data MPI_Init(...) MPI_Comm comm = MPI_COMM_WORLD; // Create plan cufftHandle plan = 0; cufftCreate(&plan); cufftMpAttachComm(plan, CUFFT_COMM_MPI, &mpi_comm); // Initialize and pass boxes to the plan cufftBox3d input_box, output_box; cufftXtSetDistribution(plan, &input_box, &output_box); // Make the cuFFT plan size_t scratch; cufftMakePlan3d(plan, nx, ny, nz, CUFFT_C2C, &scratch); // Allocate a multi-GPU descriptor cudaLibXtDesc *desc; cufftXtMalloc(plan, &desc, CUFFT_XT_FORMAT_INPLACE); // Copy data from the CPU to the GPU. // The CPU data is distributed according to CUFFT_XT_FORMAT_DISTRIBUTED_INPUT (i.e., input_box) cufftXtMemcpy(plan, desc, cpu_data, CUFFT_COPY_HOST_TO_DEVICE); // Compute a forward FFT cufftXtExecDescriptor(plan, desc, desc, CUFFT_FORWARD); // Copy memory back to the CPU. Data is now distributed according to CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT (i.e., output_box) cufftXtMemcpy(plan, cpu_data, desc, CUFFT_COPY_DEVICE_TO_HOST); // Free the multi-GPU descriptor and the plan and finalize MPI cufftXtFree(desc); cufftDestroy(plan); MPI_Finalize();
Create an empty plan
cufftCreate(&plan)
Attach the MPI communicator to the plan, indicating to cuFFT to enable the multi-process functionalities
cufftMpAttachComm(plan, CUFFT_COMM_MPI, comm)
(Optional) Attach a stream
cufftSetStream(plan, stream)
Describe the data distribution. First define the input and output boxes, describing the subsection of the global array owned by this process. Then, call
cufftXtSetDistribution
to register the boxes with cuFFTcufftBox3d input_box, output_box; input_box.lower = {...} // A size_t[3] array input_box.upper = {...} // A size_t[3] array input_box.strides = {...} // A size_t[3] array output_box.lower = {...} // A size_t[3] array output_box.upper = {...} // A size_t[3] array output_box.strides = {...} // A size_t[3] array cufftXtSetDistribution(plan, &input_box, &output_box)
Make the plan with
cufftMakePlan3d(plan, nx, ny, nz, CUFFT_C2C, &scratch)
This creates a plan for a 3D transform of (global) size
nx*ny*nz
.Allocate GPU memory to hold the data.
CUFFT_XT_FORMAT_DISTRIBUTED_INPUT
indicates that the data is distributed according toinput_box
cufftXtMalloc(plan, desc, CUFFT_XT_FORMAT_DISTRIBUTED_INPUT)
Copy data from the CPU to the GPU using
cufftXtMemcpy(plan, desc, cpu_data, CUFFT_COPY_HOST_TO_DEVICE)
Note that using
cufftXtMemcpy
is optional. Instead, data can be initialized and manipulated on the device directly. Also not thatcufftXtMemcpy
copies the entire CPU buffer, effectively ignoring the all but the first-dimension strides.Run the transform
cufftXtExecDescriptor(plan, desc, desc, CUFFT_FORWARD)
Copy data back from the GPU to the CPU. At this point, data is distributed according to
CUFFT_XT_FORMAT_DISTRIBUTED_OUTPUT
, i.e.,output_box
.cufftXtMemcpy(plan, cpu_data, desc, CUFFT_COPY_DEVICE_TO_HOST)
Free the descriptor with
cufftXtFree(desc)
Destroy the plan with
cufftDestroy(plan)
Reshape API¶
cuFFT exposes a distributed reshape API which enables redistributing data described using 3D boxes without using a cuFFT plan, descriptors or cufftXtMemcpy
.
Its usage follows the general cuFFT usage but with a new handle type.
// Initialize MPI MPI_Init(...); MPI_Comm comm = MPI_COMM_WORLD; // Define input and output boxes cufftBox3d in_box = { ... } cufftBox3d out_box = { ... } // Allocate reshape, attach MPI communicator cufftReshapeHandle handle; cufftMpCreateReshape(&handle); cufftMpAttachReshapeComm(handle, CUFFT_COMM_MPI, &comm); // Build reshape cufftMpMakeReshape(handle, sizeof(int), &in_box, &out_box); // Create a stream cudaStream_t stream; cudaStreamCreate(&stream); // Retreive workspace size size_t workspace; cufftMpGetReshapeSize(handle, &workspace); // Allocate NVSHMEM memory void *src, *dst, *scratch; _cufftMpNvshmemMalloc(input_size * sizeof(int), &src)); _cufftMpNvshmemMalloc(output_size * sizeof(int), &dst)); _cufftMpNvshmemMalloc(workspace, &scratch)); // Apply reshape CUFFT_CHECK(cufftMpExecReshapeAsync(handle, dst, src, scratch, stream)); // Cleanup _cufftMpNvshmemFree(src); _cufftMpNvshmemFree(dst); _cufftMpNvshmemFree(scratch); cufftMpDestroyReshape(handle); cudaStreamDestroy(stream); MPI_Finalize();
Create an empty reshape handle,
cufftMpCreateReshape(handle)
Attach the MPI communicator,
cufftMpAttachReshapeComm(handle, CUFFT_COMM_MPI, comm)
Make the reshape
cufftMpMakeReshape(handle, element_size, input_box, output_box)
input_box
(one on each process) describes the input distribution, andoutput_box
describes the output distribution.element_size
is the size, in bytes, of a single element. This is a collective call.Retrieve the workspace size, in bytes using
cufftMpGetReshapeSize(handle, workspace)
.Allocate workspace using NVSHMEM
Execute the reshape with
cufftMpExecReshapeAsync(handle, dst, src, workspace, stream)
This is a stream-ordered, collective call.
dst, src, workspace
should all be pointers to a symmetric-heap, NVSHMEM-allocated memory buffer. Note that this differs from MPI, wheredst, src, workspace
would be regular pointers to cudaMalloc’ed memory. See also NVSHMEM’s documentation on the symmetric heap here.Cleanup and destroy the reshape with
cufftMpDestroyReshape(handle)
.
Note
Those functions require the use of NVSHMEM to allocate memory. To facilitate this, cuFFT defines (but does not expose in the headers) the following functions. For more information, see the NVSHMEM documentation.
extern "C" cufftResult _cufftMpNvshmemMalloc(size_t size, void **buff); extern "C" cufftResult _cufftMpNvshmemFree(void *buff);
Those functions may be removed in future releases of cuFFT in favor of using the NVSHMEM API directly.
Note that symmetric-heap NVSHMEM-allocated buffers have a different semantic as local buffers
In particular, dst, src, workspace
should all refer to the same symmetric object allocated on all GPUs.
This is different than when passing pointers to a message-passing based library like MPI, where pointers refer only to local objects.
Helper functions¶
Like in the single-process, multi-GPU API, cuFFTMp provides helper functions to help with managing multiple GPUs memory. Their usage is similar to the single-process case.
cufftXtMalloc
is used to allocate memory on multiple GPUs in multiple processes. It takes an argument indicating which data decomposition.cufftXtMemcpy
is used to copy memory between CPU and GPU or between GPUs. Its usage depends on the source and destination of the data transfer:cufftXtMemcpy(plan, dst, src, CUFFT_COPY_HOST_TO_DEVICE)
copies distributed CPU data to GPU, keeping the distribution unchanged. This means the CPU data has to be distributed in the same way as the GPU data.cufftXtMemcpy(plan, dst, src, CUFFT_COPY_DEVICE_TO_HOST)
copies distributed GPU data to CPU, keeping the distribution unchanged. This means the CPU data will be distributed in the same way as the GPU data.cufftXtMemcpy(plan, dst, src, CUFFT_COPY_DEVICE_TO_DEVICE)
can be used to re-distribute GPU data fromCUFFT_XT_FORMAT_INPUT_SHUFFLED
toCUFFT_XT_FORMAT_INPUT
(or vice-versa). This can be useful to transform permuted data back to natural order for easier post-processing.
cufftXtFree
is used to free a descriptor