FFT#

class nvmath.distributed.fft.FFT(operand, distribution, *, options=None, stream=None)[source]#

Create a stateful object that encapsulates the specified distributed FFT computations and required resources. This object ensures the validity of resources during use and releases them when they are no longer needed to prevent misuse.

This object encompasses all functionalities of function-form APIs fft() and ifft(), which are convenience wrappers around it. The stateful object also allows for the amortization of preparatory costs when the same FFT operation is to be performed on multiple operands with the same problem specification (see reset_operand() for more details).

Using the stateful object typically involves the following steps:

Problem Specification: Initialize the object with a defined operation and options.
Preparation: Use plan() to determine the best algorithmic implementation for this specific FFT operation.
Execution: Perform the FFT computation with execute(), which can be either forward or inverse FFT transformation.
Resource Management: Ensure all resources are released either by explicitly calling free() or by managing the stateful object within a context manager.

Detailed information on each step described above can be obtained by passing in a logging.Logger object to FFTOptions or by setting the appropriate options in the root logger object, which is used by default:

>>> import logging
>>> logging.basicConfig(
...     level=logging.INFO,
...     format="%(asctime)s %(levelname)-8s %(message)s",
...     datefmt="%m-%d %H:%M:%S",
... )

Parameters:

operand –
A tensor (ndarray-like object). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

Important

GPU operands must be on the symmetric heap (for example, allocated with nvmath.distributed.allocate_symmetric_memory()).
distribution – Specifies the distribution of input and output operands across processes, which can be: (i) according to a Slab distribution (see Slab), or (ii) a custom box distribution. With Slab distribution, this indicates the distribution of the input operand (the output operand will use the complementary Slab distribution). With box distribution, this indicates the input and output boxes.
options – Specify options for the FFT as a FFTOptions object. Alternatively, a dict containing the parameters for the FFTOptions constructor can also be provided. If not specified, the value will be set to the default-constructed FFTOptions object.
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.

Examples

>>> import cupy as cp
>>> import nvmath.distributed

Get MPI communicator used to initialize nvmath.distributed (for information on initializing nvmath.distributed, you can refer to the documentation or to the FFT examples in nvmath/examples/distributed/fft):

>>> comm = nvmath.distributed.get_context().communicator

Get the number of processes:

>>> nranks = comm.Get_size()

Create a 3-D complex128 ndarray on GPU symmetric memory, distributed according to the Slab distribution on the X axis (the global shape is (128, 128, 128)):

>>> shape = 128 // nranks, 128, 128

cuFFTMp uses the NVSHMEM PGAS model for distributed computation, which requires GPU operands to be on the symmetric heap:

>>> a = nvmath.distributed.allocate_symmetric_memory(shape, cp, dtype=cp.complex128)

After allocating, we initialize the CuPy ndarray’s memory:

>>> a[:] = cp.random.rand(*shape) + 1j * cp.random.rand(*shape)

We will define a 3-D C2C FFT operation, creating an FFT object encapsulating the above problem specification. Each process provides their own local operand (which is part of the PGAS space, but otherwise can be operated on as any other CuPy ndarray for local operations) and specifies how the operand is distributed across processes:

>>> f = nvmath.distributed.fft.FFT(a, distribution=nvmath.distributed.fft.Slab.X)

More information on distribution of operands can be found in the documentation: (TODO: link to docs).

Options can be provided above to control the behavior of the operation using the options argument (see FFTOptions).

Next, plan the FFT:

>>> f.plan()

Now execute the FFT, and obtain the result r1 as a CuPy ndarray. Note that distributed FFT computations are inplace, so operands a and r1 share the same symmetric memory buffer:

>>> r1 = f.execute()

Finally, free the FFT object’s resources. To avoid this explicit call, it’s recommended to use the FFT object as a context manager as shown below, if possible.

>>> f.free()

Any symmetric memory that is owned by the user must be deleted explicitly (this is a collective call and must be called by all processes). Note that because operands a and r1 share the same buffer, only one of them must be freed:

>>> nvmath.distributed.free_symmetric_memory(a)

Note that all FFT methods execute on the current stream by default. Alternatively, the stream argument can be used to run a method on a specified stream.

Let’s now look at the same problem with NumPy ndarrays on the CPU.

Create a 3-D complex128 NumPy ndarray on the CPU:

>>> import numpy as np
>>> shape = 128 // nranks, 128, 128
>>> a = np.random.rand(*shape) + 1j * np.random.rand(*shape)

Create an FFT object encapsulating the problem specification described earlier and use it as a context manager.

>>> with nvmath.distributed.fft.FFT(a, distribution=Slab.X) as f:
...     f.plan()
...
...     # Execute the FFT to get the first result.
...     r1 = f.execute()

All the resources used by the object are released at the end of the block.

The operation was performed on the GPU, with the NumPy array temporarily copied to GPU symmetric memory and transformed on the GPU.

Further examples can be found in the nvmath/examples/distributed/fft directory.

Methods

__init__( operand, distribution: Slab | Sequence[Sequence[Sequence[int]]], *, options: FFTOptions | None = None, stream=None, )[source]#

execute( *, direction=None, stream=None, release_workspace=False, sync_symmetric_memory: bool = True, )[source]#

Execute the FFT operation.

Parameters:

direction – Specify whether forward or inverse FFT is performed (FFTDirection object, or as a string from [‘forward’, ‘inverse’], “or as an int from [-1, 1] denoting forward and inverse directions respectively).
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.
release_workspace – A value of True specifies that the FFT object should release workspace memory back to the symmetric memory pool on function return, while a value of False specifies that the object should retain the memory. This option may be set to True if the application performs other operations that consume a lot of memory between successive calls to the (same or different) execute() API, but incurs an overhead due to obtaining and releasing workspace memory from and to the symmetric memory pool on every call. The default is False. NOTE: All processes must use the same value or the application can deadlock.
sync_symmetric_memory – Indicates whether to issue a symmetric memory synchronization operation on the execute stream before the FFT. Note that before the FFT starts executing, it is required that the input operand be ready on all processes. A symmetric memory synchronization ensures completion and visibility by all processes of previously issued local stores to symmetric memory. Advanced users who choose to manage the synchronization on their own using the appropriate NVSHMEM API, or who know that GPUs are already synchronized on the source operand, can set this to False.

Returns:

The transformed operand, which remains on the same device and utilizes the same package as the input operand. The data type and shape of the transformed operand depend on the type of input operand, and choice of distribution and reshape option:

For C2C FFT, the data type remains identical to the input.
For slab distribution with reshape=True, the shape will remain identical.
For slab distribution with reshape=False, the shape will be the converse slab shape.
For custom box distribution, the shape will depend on the output box of each process.

For GPU operands, the result will be in symmetric memory and the user is responsible for explicitly deallocating it (for example, using nvmath.distributed.free_symmetric_memory(tensor)).

free()[source]#

Free FFT resources.

It is recommended that the FFT object be used within a context, but if it is not possible then this method must be called explicitly to ensure that the FFT resources (especially internal library objects) are properly cleaned up.

plan( *, stream: AnyStream | None = None, )[source]#

Plan the FFT.

Parameters:: stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.

reset_operand( operand=None, distribution: Slab | Sequence[Sequence[Sequence[int]]] | None = None, *, stream=None, )[source]#

Reset the operand held by this FFT instance. This method has two use cases:

it can be used to provide a new operand for execution
it can be used to release the internal reference to the previous operand and potentially make its memory available for other use by passing operand=None.

Parameters:

operand –
A tensor (ndarray-like object) compatible with the previous one or None (default). A value of None will release the internal reference to the previous operand and user is expected to set a new operand before again calling execute(). The new operand is considered compatible if all the following properties match with the previous one:
- The operand distribution: (a) if the FFT was planned using a Slab distribution, the reset operand must also use a Slab distribution (both X and Y axes are valid regardless of the slab axis at plan time), (b) if the FFT was planned using a box distribution, the distribution of the reset operand must be (input_box, output_box) or (output_box, input_box) where input_box and output_box are the boxes specified at plan time.
- The operand data type.
- The package that the new operand belongs to.
- The memory space of the new operand (CPU or GPU).
- The device that new operand belongs to if it is on GPU.
distribution – Specifies the distribution of input and output operands across processes, which can be: (i) according to a Slab distribution (see Slab), or (ii) a custom box distribution. With Slab distribution, this indicates the distribution of the input operand (the output operand will use the complementary Slab distribution). With box distribution, this indicates the input and output boxes.
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used..

Examples

>>> import cupy as cp
>>> import nvmath.distributed

>>> comm = nvmath.distributed.get_context().communicator
>>> nranks = comm.Get_size()

Create a 3-D complex128 ndarray on GPU symmetric memory, distributed according to the Slab distribution on the X axis (the global shape is (128, 128, 128)):

>>> shape = 128 // nranks, 128, 128
>>> dtype = cp.complex128
>>> a = nvmath.distributed.allocate_symmetric_memory(shape, cp, dtype=dtype)
>>> a[:] = cp.random.rand(*shape) + 1j * cp.random.rand(*shape)

Create an FFT object as a context manager

>>> with nvmath.distributed.fft.FFT(a, nvmath.distributed.fft.Slab.X) as f:
...     # Plan the FFT
...     f.plan()
...
...     # Execute the FFT to get the first result.
...     r1 = f.execute()
...
...     # Reset the operand to a new CuPy ndarray.
...     b = nvmath.distributed.allocate_symmetric_memory(shape, cp, dtype=dtype)
...     b[:] = cp.random.rand(*shape) + 1j * cp.random.rand(*shape)
...     f.reset_operand(b)
...
...     # Execute to get the new result corresponding to the updated operand.
...     r2 = f.execute()

With reset_operand(), minimal overhead is achieved as problem specification and planning are only performed once.

For the particular example above, explicitly calling reset_operand() is equivalent to updating the operand in-place, i.e, replacing f.reset_operand(b) with a[:]=b. Note that updating the operand in-place should be adopted with caution as it can only yield the expected result and incur no additional copies under the additional constraints below:

The operand’s distribution is the same.

For more details, please refer to inplace update example.