FFT#
-
class nvmath.
distributed. fft. FFT(operand, distribution, *, options=None, stream=None)[source]# Create a stateful object that encapsulates the specified distributed FFT computations and required resources. This object ensures the validity of resources during use and releases them when they are no longer needed to prevent misuse.
This object encompasses all functionalities of function-form APIs
fft()
andifft()
, which are convenience wrappers around it. The stateful object also allows for the amortization of preparatory costs when the same FFT operation is to be performed on multiple operands with the same problem specification (seereset_operand()
for more details).Using the stateful object typically involves the following steps:
Problem Specification: Initialize the object with a defined operation and options.
Preparation: Use
plan()
to determine the best algorithmic implementation for this specific FFT operation.Execution: Perform the FFT computation with
execute()
, which can be either forward or inverse FFT transformation.Resource Management: Ensure all resources are released either by explicitly calling
free()
or by managing the stateful object within a context manager.
Detailed information on each step described above can be obtained by passing in a
logging.Logger
object toFFTOptions
or by setting the appropriate options in the root logger object, which is used by default:>>> import logging >>> logging.basicConfig( ... level=logging.INFO, ... format="%(asctime)s %(levelname)-8s %(message)s", ... datefmt="%m-%d %H:%M:%S", ... )
- Parameters:
operand –
A tensor (ndarray-like object). The currently supported types are
numpy.ndarray
,cupy.ndarray
, andtorch.Tensor
.Important
GPU operands must be on the symmetric heap (for example, allocated with
nvmath.
).distributed. allocate_symmetric_memory() distribution – Specifies the distribution of input and output operands across processes, which can be: (i) according to a Slab distribution (see
Slab
), or (ii) a custom box distribution. With Slab distribution, this indicates the distribution of the input operand (the output operand will use the complementary Slab distribution). With box distribution, this indicates the input and output boxes.options – Specify options for the FFT as a
FFTOptions
object. Alternatively, adict
containing the parameters for theFFTOptions
constructor can also be provided. If not specified, the value will be set to the default-constructedFFTOptions
object.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t
(as Pythonint
),cupy.cuda.Stream
, andtorch.cuda.Stream
. If a stream is not provided, the current stream from the operand package will be used.
See also
Examples
>>> import cupy as cp >>> import nvmath.distributed
Get MPI communicator used to initialize nvmath.distributed (for information on initializing nvmath.distributed, you can refer to the documentation or to the FFT examples in nvmath/examples/distributed/fft):
>>> comm = nvmath.distributed.get_context().communicator
Get the number of processes:
>>> nranks = comm.Get_size()
Create a 3-D complex128 ndarray on GPU symmetric memory, distributed according to the Slab distribution on the X axis (the global shape is (128, 128, 128)):
>>> shape = 128 // nranks, 128, 128
cuFFTMp uses the NVSHMEM PGAS model for distributed computation, which requires GPU operands to be on the symmetric heap:
>>> a = nvmath.distributed.allocate_symmetric_memory(shape, cp, dtype=cp.complex128)
After allocating, we initialize the CuPy ndarray’s memory:
>>> a[:] = cp.random.rand(*shape) + 1j * cp.random.rand(*shape)
We will define a 3-D C2C FFT operation, creating an FFT object encapsulating the above problem specification. Each process provides their own local operand (which is part of the PGAS space, but otherwise can be operated on as any other CuPy ndarray for local operations) and specifies how the operand is distributed across processes:
>>> f = nvmath.distributed.fft.FFT(a, distribution=nvmath.distributed.fft.Slab.X)
More information on distribution of operands can be found in the documentation: (TODO: link to docs).
Options can be provided above to control the behavior of the operation using the
options
argument (seeFFTOptions
).Next, plan the FFT:
>>> f.plan()
Now execute the FFT, and obtain the result
r1
as a CuPy ndarray. Note that distributed FFT computations are inplace, so operands a and r1 share the same symmetric memory buffer:>>> r1 = f.execute()
Finally, free the FFT object’s resources. To avoid this explicit call, it’s recommended to use the FFT object as a context manager as shown below, if possible.
>>> f.free()
Any symmetric memory that is owned by the user must be deleted explicitly (this is a collective call and must be called by all processes). Note that because operands a and r1 share the same buffer, only one of them must be freed:
>>> nvmath.distributed.free_symmetric_memory(a)
Note that all
FFT
methods execute on the current stream by default. Alternatively, thestream
argument can be used to run a method on a specified stream.Let’s now look at the same problem with NumPy ndarrays on the CPU.
Create a 3-D complex128 NumPy ndarray on the CPU:
>>> import numpy as np >>> shape = 128 // nranks, 128, 128 >>> a = np.random.rand(*shape) + 1j * np.random.rand(*shape)
Create an FFT object encapsulating the problem specification described earlier and use it as a context manager.
>>> with nvmath.distributed.fft.FFT(a, distribution=Slab.X) as f: ... f.plan() ... ... # Execute the FFT to get the first result. ... r1 = f.execute()
All the resources used by the object are released at the end of the block.
The operation was performed on the GPU, with the NumPy array temporarily copied to GPU symmetric memory and transformed on the GPU.
Further examples can be found in the nvmath/examples/distributed/fft directory.
Methods
- __init__(
- operand,
- distribution: Slab | Sequence[Sequence[Sequence[int]]],
- *,
- options: FFTOptions | None = None,
- stream=None,
- execute(
- *,
- direction=None,
- stream=None,
- release_workspace=False,
- sync_symmetric_memory: bool = True,
Execute the FFT operation.
- Parameters:
direction – Specify whether forward or inverse FFT is performed (
FFTDirection
object, or as a string from [‘forward’, ‘inverse’], “or as an int from [-1, 1] denoting forward and inverse directions respectively).stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t
(as Pythonint
),cupy.cuda.Stream
, andtorch.cuda.Stream
. If a stream is not provided, the current stream from the operand package will be used.release_workspace – A value of
True
specifies that the FFT object should release workspace memory back to the symmetric memory pool on function return, while a value ofFalse
specifies that the object should retain the memory. This option may be set toTrue
if the application performs other operations that consume a lot of memory between successive calls to the (same or different)execute()
API, but incurs an overhead due to obtaining and releasing workspace memory from and to the symmetric memory pool on every call. The default isFalse
. NOTE: All processes must use the same value or the application can deadlock.sync_symmetric_memory – Indicates whether to issue a symmetric memory synchronization operation on the execute stream before the FFT. Note that before the FFT starts executing, it is required that the input operand be ready on all processes. A symmetric memory synchronization ensures completion and visibility by all processes of previously issued local stores to symmetric memory. Advanced users who choose to manage the synchronization on their own using the appropriate NVSHMEM API, or who know that GPUs are already synchronized on the source operand, can set this to False.
- Returns:
The transformed operand, which remains on the same device and utilizes the same package as the input operand. The data type and shape of the transformed operand depend on the type of input operand, and choice of distribution and reshape option:
For C2C FFT, the data type remains identical to the input.
For slab distribution with reshape=True, the shape will remain identical.
For slab distribution with reshape=False, the shape will be the converse slab shape.
For custom box distribution, the shape will depend on the output box of each process.
For GPU operands, the result will be in symmetric memory and the user is responsible for explicitly deallocating it (for example, using
nvmath.
).distributed. free_symmetric_memory(tensor)
- free()[source]#
Free FFT resources.
It is recommended that the
FFT
object be used within a context, but if it is not possible then this method must be called explicitly to ensure that the FFT resources (especially internal library objects) are properly cleaned up.
- plan(
- *,
- stream: AnyStream | None = None,
Plan the FFT.
- Parameters:
stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t
(as Pythonint
),cupy.cuda.Stream
, andtorch.cuda.Stream
. If a stream is not provided, the current stream from the operand package will be used.
- reset_operand( )[source]#
Reset the operand held by this
FFT
instance. This method has two use cases:it can be used to provide a new operand for execution
it can be used to release the internal reference to the previous operand and potentially make its memory available for other use by passing
operand=None
.
- Parameters:
operand –
A tensor (ndarray-like object) compatible with the previous one or
None
(default). A value ofNone
will release the internal reference to the previous operand and user is expected to set a new operand before again callingexecute()
. The new operand is considered compatible if all the following properties match with the previous one:The operand distribution: (a) if the FFT was planned using a Slab distribution, the reset operand must also use a Slab distribution (both X and Y axes are valid regardless of the slab axis at plan time), (b) if the FFT was planned using a box distribution, the distribution of the reset operand must be (input_box, output_box) or (output_box, input_box) where input_box and output_box are the boxes specified at plan time.
The operand data type.
The package that the new operand belongs to.
The memory space of the new operand (CPU or GPU).
The device that new operand belongs to if it is on GPU.
distribution – Specifies the distribution of input and output operands across processes, which can be: (i) according to a Slab distribution (see
Slab
), or (ii) a custom box distribution. With Slab distribution, this indicates the distribution of the input operand (the output operand will use the complementary Slab distribution). With box distribution, this indicates the input and output boxes.stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include
cudaStream_t
(as Pythonint
),cupy.cuda.Stream
, andtorch.cuda.Stream
. If a stream is not provided, the current stream from the operand package will be used..
Examples
>>> import cupy as cp >>> import nvmath.distributed
Get MPI communicator used to initialize nvmath.distributed (for information on initializing nvmath.distributed, you can refer to the documentation or to the FFT examples in nvmath/examples/distributed/fft):
>>> comm = nvmath.distributed.get_context().communicator >>> nranks = comm.Get_size()
Create a 3-D complex128 ndarray on GPU symmetric memory, distributed according to the Slab distribution on the X axis (the global shape is (128, 128, 128)):
>>> shape = 128 // nranks, 128, 128 >>> dtype = cp.complex128 >>> a = nvmath.distributed.allocate_symmetric_memory(shape, cp, dtype=dtype) >>> a[:] = cp.random.rand(*shape) + 1j * cp.random.rand(*shape)
Create an FFT object as a context manager
>>> with nvmath.distributed.fft.FFT(a, nvmath.distributed.fft.Slab.X) as f: ... # Plan the FFT ... f.plan() ... ... # Execute the FFT to get the first result. ... r1 = f.execute() ... ... # Reset the operand to a new CuPy ndarray. ... b = nvmath.distributed.allocate_symmetric_memory(shape, cp, dtype=dtype) ... b[:] = cp.random.rand(*shape) + 1j * cp.random.rand(*shape) ... f.reset_operand(b) ... ... # Execute to get the new result corresponding to the updated operand. ... r2 = f.execute()
With
reset_operand()
, minimal overhead is achieved as problem specification and planning are only performed once.For the particular example above, explicitly calling
reset_operand()
is equivalent to updating the operand in-place, i.e, replacingf.reset_operand(b)
witha[:]=b
. Note that updating the operand in-place should be adopted with caution as it can only yield the expected result and incur no additional copies under the additional constraints below:The operand’s distribution is the same.
For more details, please refer to inplace update example.