FFT#

class nvmath.distributed.fft.FFT( operand, /, *, distribution: Distribution | Sequence[Box], options: FFTOptions | None = None, stream: AnyStream | None = None, )[source]#

Create a stateful object that encapsulates the specified distributed FFT computations and required resources. This object ensures the validity of resources during use and releases them when they are no longer needed to prevent misuse.

This object encompasses all functionalities of function-form APIs fft(), ifft(), rfft(), and irfft(), which are convenience wrappers around it. The stateful object also allows for the amortization of preparatory costs when the same FFT operation is to be performed on multiple operands with the same problem specification (see reset_operand() for more details).

Using the stateful object typically involves the following steps:

Problem Specification: Initialize the object with a defined operation and options.
Preparation: Use plan() to determine the best algorithmic implementation for this specific FFT operation.
Execution: Perform the FFT computation with execute(), which can be either forward or inverse FFT transformation.
Resource Management: Ensure all resources are released either by explicitly calling free() or by managing the stateful object within a context manager.

Detailed information on each step described above can be obtained by passing in a logging.Logger object to FFTOptions or by setting the appropriate options in the root logger object, which is used by default:

>>> import logging
>>> logging.basicConfig(
...     level=logging.INFO,
...     format="%(asctime)s %(levelname)-8s %(message)s",
...     datefmt="%m-%d %H:%M:%S",
... )

Changed in version 0.9.0: The operand parameter is now positional-only.

Parameters:

operand –
A tensor (ndarray-like object). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

Important

GPU operands must be on the symmetric heap (for example, allocated with nvmath.distributed.allocate_symmetric_memory()).
distribution – Specifies the distribution of input and output operands across processes, which can be: (i) according to a Slab distribution (see nvmath.distributed.distribution.Slab), or (ii) a custom box distribution (see nvmath.distributed.distribution.Box). With Slab distribution, this indicates the distribution of the input operand (the output operand will use the complementary Slab distribution). With box distribution, this indicates the input and output boxes.
options – Specify options for the FFT as a FFTOptions object. Alternatively, a dict containing the parameters for the FFTOptions constructor can also be provided. If not specified, the value will be set to the default-constructed FFTOptions object.
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.

Examples

>>> import cupy as cp
>>> import nvmath.distributed

Get process group used to initialize nvmath.distributed (for information on initializing nvmath.distributed, you can refer to the documentation or to the FFT examples in nvmath/examples/distributed/fft):

>>> process_group = nvmath.distributed.get_context().process_group

Get the number of processes:

>>> nranks = process_group.nranks

Create a 3-D complex128 ndarray on GPU symmetric memory, distributed according to the Slab distribution on the X axis (the global shape is (128, 128, 128)):

>>> from nvmath.distributed.distribution import Slab
>>> shape = 128 // nranks, 128, 128

cuFFTMp uses the NVSHMEM PGAS model for distributed computation, which requires GPU operands to be on the symmetric heap:

>>> a = nvmath.distributed.allocate_symmetric_memory(shape, cp, dtype=cp.complex128)

After allocating, we initialize the CuPy ndarray’s memory:

>>> a[:] = cp.random.rand(*shape) + 1j * cp.random.rand(*shape)

We will define a 3-D C2C FFT operation, creating an FFT object encapsulating the above problem specification. Each process provides their own local operand (which is part of the PGAS space, but otherwise can be operated on as any other CuPy ndarray for local operations) and specifies how the operand is distributed across processes:

>>> f = nvmath.distributed.fft.FFT(a, distribution=Slab.X)

More information on distribution of operands can be found in the documentation: https://docs.nvidia.com/cuda/nvmath-python/latest/distributed-apis/fft/index.html

Options can be provided above to control the behavior of the operation using the options argument (see FFTOptions).

Next, plan the FFT:

>>> f.plan()

Now execute the FFT, and obtain the result r1 as a CuPy ndarray. Note that distributed FFT computations are inplace, so operands a and r1 share the same symmetric memory buffer:

>>> r1 = f.execute()

Finally, free the FFT object’s resources. To avoid this explicit call, it’s recommended to use the FFT object as a context manager as shown below, if possible.

>>> f.free()

Any symmetric memory that is owned by the user must be deleted explicitly (this is a collective call and must be called by all processes). Note that because operands a and r1 share the same buffer, only one of them must be freed:

>>> nvmath.distributed.free_symmetric_memory(a)

Note that all FFT methods execute on the current stream by default. Alternatively, the stream argument can be used to run a method on a specified stream.

Let’s now look at the same problem with NumPy ndarrays on the CPU.

Create a 3-D complex128 NumPy ndarray on the CPU:

>>> import numpy as np
>>> shape = 128 // nranks, 128, 128
>>> a = np.random.rand(*shape) + 1j * np.random.rand(*shape)

Create an FFT object encapsulating the problem specification described earlier and use it as a context manager.

>>> with nvmath.distributed.fft.FFT(a, distribution=Slab.X) as f:
...     f.plan()
...
...     # Execute the FFT to get the first result.
...     r1 = f.execute()

All the resources used by the object are released at the end of the block.

The operation was performed on the GPU, with the NumPy array temporarily copied to GPU symmetric memory and transformed on the GPU.

Further examples can be found in the nvmath/examples/distributed/fft directory.

Methods

execute( *, direction: FFTDirection | None = None, stream: AnyStream | None = None, release_workspace: bool = False, sync_symmetric_memory: bool = True, )[source]#

Execute the FFT operation.

Parameters:

direction – Specify whether forward or inverse FFT is performed (FFTDirection object, or as a string from [‘forward’, ‘inverse’], “or as an int from [-1, 1] denoting forward and inverse directions respectively).
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.
release_workspace – A value of True specifies that the FFT object should release workspace memory back to the symmetric memory pool on function return, while a value of False specifies that the object should retain the memory. This option may be set to True if the application performs other operations that consume a lot of memory between successive calls to the (same or different) execute() API, but incurs an overhead due to obtaining and releasing workspace memory from and to the symmetric memory pool on every call. The default is False. NOTE: All processes must use the same value or the application can deadlock.
sync_symmetric_memory – Indicates whether to issue a symmetric memory synchronization operation on the execute stream before the FFT. Note that before the FFT starts executing, it is required that the input operand be ready on all processes. A symmetric memory synchronization ensures completion and visibility by all processes of previously issued local stores to symmetric memory. Advanced users who choose to manage the synchronization on their own using the appropriate NVSHMEM API, or who know that GPUs are already synchronized on the source operand, can set this to False.

Returns:

The transformed operand, which remains on the same device and utilizes the same package as the input operand. The data type and shape of the transformed operand depend on the type of input operand, and choice of distribution and reshape option:

For C2C FFT, the data type remains identical to the input.
For R2C and C2R FFT, the data type differs from the input. The global output shape differs from the global input shape, which affects the shape of the result on every process.
For slab distribution with reshape=True, the shape on this process is the slab shape according to the same distribution as the input operand.
For slab distribution with reshape=False, the shape on this process is the complementary slab shape.
For custom box distribution, the shape will depend on the output box of each process.

For GPU operands, the result will be in symmetric memory and the user is responsible for explicitly deallocating it (for example, using nvmath.distributed.free_symmetric_memory(tensor)).

free()[source]#

Free FFT resources.

It is recommended that the FFT object be used within a context, but if it is not possible then this method must be called explicitly to ensure that the FFT resources (especially internal library objects) are properly cleaned up.

plan( *, stream: AnyStream | None = None, )[source]#

Plan the FFT.

Parameters:: stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.

release_operand()[source]#

This method is experimental and potentially subject to future changes.

Added in version 0.9.0.

This method does two things:

Releases internal references to the user-provided operand, so that this instance no longer contributes to its reference count.
Frees any internal copies (mirrors) that were created when the user-provided operand resides in a different memory space than the execution (i.e., copies made during construction or reset_operand() / reset_operand_unchecked() if present).

This functionality can be useful in memory-constrained scenarios, e.g. where multiple stateful objects need to coexist. Leveraging this functionality, the caller can reduce memory usage while retaining the planned state.

Parameters:: None
Returns:: None

Semantics:

Preserves the planned state of the stateful object.
After calling this method, reset_operand() (or reset_operand_unchecked() if present) must be called to supply a new operand before the next execute() call. Failure to do so will result in a runtime error. Device-side copies will be re-allocated as needed.
For cross-space scenarios (e.g. CPU operand with GPU execution, or GPU operand with CPU execution): execution is guaranteed to be always blocking, so execute() does not return until all computation is complete. It is therefore always safe to call this method after calling execute() without additional synchronization.
When the operand is in the same memory space as the execution (e.g. GPU operand with GPU execution): in such case, this method drops this instance’s internal reference to the user-provided operand. If the reference count of the operand reaches zero, its memory may be freed, so particular attention should be paid. The caller is responsible to ensure that if such deallocation happens, it is ordered after pending computation (e.g. by retaining a reference until the computation is complete, or by synchronizing the stream). Failure to do so is analogous to use-after-free.

See Overview, Stateful APIs: Design and Usage Patterns for operand lifecycle and usage patterns, and Stream Semantics for stream ordering rules.

reset_operand( operand, *, distribution: Distribution | Sequence[Box], stream: AnyStream | None = None, )[source]#

Reset the operand held by this FFT instance to a new compatible operand for subsequent execution.

Parameters:

operand –
A tensor (ndarray-like object) compatible with the previous one. The new operand is considered compatible if all the following properties match with the previous one:
- The operand data type.
- The package that the new operand belongs to.
- The memory space of the new operand (CPU or GPU).
- The device that the new operand belongs to if it is on GPU.
- The operand shape must be consistent with the specified distribution (see below).
distribution –
Specifies the distribution of input and output operands across processes, which can be: (i) according to a Slab distribution (see nvmath.distributed.distribution.Slab), or (ii) a custom box distribution (see nvmath.distributed.distribution.Box). With Slab distribution, this indicates the distribution of the input operand (the output operand will use the complementary Slab distribution). With box distribution, this indicates the input and output boxes. This argument is required. The distribution must be compatible with the one used at plan time:
- If the FFT was planned using a Slab distribution, the reset distribution must also be a Slab distribution. For C2C transforms, both Slab.X and Slab.Y are valid regardless of the slab axis at plan time. For R2C and C2R transforms, the distribution must be the same as at plan time.
- If the FFT was planned using a box distribution, the reset distribution must use the same (input_box, output_box) pair specified at plan time (the order may be swapped).
- If reshape=True was specified in the options, the distribution cannot be changed.
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used. See Stream Semantics for more details on stream handling.

Examples

>>> import cupy as cp
>>> import nvmath.distributed

>>> process_group = nvmath.distributed.get_context().process_group
>>> nranks = process_group.nranks

Create a 3-D complex128 ndarray on GPU symmetric memory, distributed according to the Slab distribution on the X axis (the global shape is (128, 128, 128)):

>>> from nvmath.distributed.distribution import Slab
>>> shape = 128 // nranks, 128, 128
>>> dtype = cp.complex128
>>> a = nvmath.distributed.allocate_symmetric_memory(shape, cp, dtype=dtype)
>>> a[:] = cp.random.rand(*shape) + 1j * cp.random.rand(*shape)

Create an FFT object as a context manager

>>> with nvmath.distributed.fft.FFT(a, distribution=Slab.X) as f:
...     # Plan the FFT
...     f.plan()
...
...     # Execute the FFT to get the first result.
...     r1 = f.execute()
...
...     # Reset the operand to a new CuPy ndarray.
...     b = nvmath.distributed.allocate_symmetric_memory(shape, cp, dtype=dtype)
...     b[:] = cp.random.rand(*shape) + 1j * cp.random.rand(*shape)
...     f.reset_operand(b, distribution=Slab.X)
...
...     # Execute to get the new result corresponding to the updated operand.
...     r2 = f.execute()

With reset_operand(), minimal overhead is achieved as problem specification and planning are only performed once.

For the particular example above, explicitly calling reset_operand() is equivalent to updating the operand in-place, i.e, replacing f.reset_operand(b, distribution=Slab.X) with a[:]=b. Note that updating the operand in-place should be adopted with caution as it can only yield the expected result and incur no additional copies under the additional constraints below:

The operand’s distribution is the same.

For more details, please refer to inplace update example.