Reshape#

class nvmath.distributed.reshape.Reshape(
operand,
/,
input_box,
output_box,
*,
options=None,
stream=None,
)[source]#

Create a stateful object that encapsulates the specified distributed Reshape and required resources. This object ensures the validity of resources during use and releases them when they are no longer needed to prevent misuse.

This object encompasses all functionalities of the function-form API reshape(), which is a convenience wrapper around it. The stateful object also allows for the amortization of preparatory costs when the same Reshape operation is to be performed on multiple operands with the same problem specification (see reset_operand() for more details).

Using the stateful object typically involves the following steps:

  1. Problem Specification: Initialize the object with a defined operation and options.

  2. Preparation: Use plan() to determine the best distributed algorithmic implementation for this specific Reshape operation.

  3. Execution: Perform the Reshape with execute().

  4. Resource Management: Ensure all resources are released either by explicitly calling free() or by managing the stateful object within a context manager.

Detailed information on each step described above can be obtained by passing in a logging.Logger object to ReshapeOptions or by setting the appropriate options in the root logger object, which is used by default:

>>> import logging
>>> logging.basicConfig(
...     level=logging.INFO,
...     format="%(asctime)s %(levelname)-8s %(message)s",
...     datefmt="%m-%d %H:%M:%S",
... )
Parameters:
  • operand

    A tensor (ndarray-like object). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

    Important

    GPU operands must be on the symmetric heap (for example, allocated with nvmath.distributed.allocate_symmetric_memory()).

  • input_box – The box specifying the distribution of the input operand across processes, where each process specifies which portion of the global array it holds. A box is a pair of coordinates specifying the lower and upper extent for each dimension.

  • output_box – The box specifying the distribution of the result across processes, where each process specifies which portion of the global array it will hold after reshaping. A box is a pair of coordinates specifying the lower and upper extent for each dimension.

  • options – Specify options for the Reshape as a ReshapeOptions object. Alternatively, a dict containing the parameters for the ReshapeOptions constructor can also be provided. If not specified, the value will be set to the default-constructed ReshapeOptions object.

  • stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.

Examples

>>> import cupy as cp
>>> import nvmath.distributed

Get MPI communicator used to initialize nvmath.distributed (for information on initializing nvmath.distributed, you can refer to the documentation or to the Reshape examples in nvmath/examples/distributed/reshape):

>>> comm = nvmath.distributed.get_context().communicator

Let’s create a 3D floating-point ndarray on GPU, distributed across a certain number of processes, with each holding a portion of the ndarray. As an example, process 0 holds a 3D box of the global 3D array of shape (4, 4, 4).

>>> shape = 4, 4, 4

Reshape uses the NVSHMEM PGAS model, which requires GPU operands to be on the symmetric heap:

>>> if comm.Get_rank() == 0:
...     a[:] = cp.random.rand(*shape)
... else:
...     a = ...  # each process holds a different section of the global array.
... a = nvmath.distributed.allocate_symmetric_memory(shape, cp)

With Reshape, we will change how the ndarray is distributed, by having each process specify the input and output section of the global array. For process 0, let’s assume that it holds the 3D box that goes from the lower corner given by coordinates (0, 0, 0) to the upper corner (4, 4, 4).

NOTE: each process has its own input and output boxes which are different to those of other processes, as each holds a different section of the global array.

>>> if comm.Get_rank() == 0:
...     input_lower = (0, 0, 0)
...     input_upper = (4, 4, 4)
...     input_box = [input_lower, input_upper]
...     output_box = ...
... else:
...     input_box = ...  # the input box depends on the process.
...     output_box = ...  # the output box depends on the process.

Create a Reshape object encapsulating the problem specification above:

>>> r = nvmath.distributed.reshape.Reshape(a, input_box, output_box)

Options can be provided above to control the behavior of the operation using the options argument (see ReshapeOptions).

Next, plan the Reshape:

>>> r.plan()

Now execute the Reshape, and obtain the result b as a CuPy ndarray. Reshape always performs the distributed operation on the GPU.

>>> b = r.execute()

Finally, free the Reshape object’s resources. To avoid this explicit call, it’s recommended to use the Reshape object as a context manager as shown below, if possible.

>>> r.free()

Any symmetric memory that is owned by the user must be deleted explicitly (this is a collective call and must be called by all processes):

>>> nvmath.distributed.free_symmetric_memory(a, b)

Note that all Reshape methods execute on the current stream by default. Alternatively, the stream argument can be used to run a method on a specified stream.

Let’s now look at the same problem with NumPy ndarrays on the CPU.

>>> import numpy as np
>>> a = np.random.rand(*shape)  # each process holds a different section

Create a Reshape object encapsulating the problem specification described earlier and use it as a context manager.

>>> with nvmath.distributed.reshape.Reshape(a, input_box, output_box) as r:
...     r.plan()
...
...     # Execute the Reshape to redistribute the ndarray.
...     b = r.execute()

All the resources used by the object are released at the end of the block.

Reshape always executes on the GPU. In this case, because a resides in host memory, the NumPy array is temporarily copied to device memory (on the symmetric memory heap), re-distributed on the GPU, and the result is copied to host memory as a NumPy array.

Further examples can be found in the nvmath/examples/distributed/reshape directory.

Methods

__init__(
operand,
/,
input_box,
output_box,
*,
options: ReshapeOptions | None = None,
stream=None,
)[source]#
execute(
stream=None,
release_workspace=False,
sync_symmetric_memory: bool = True,
)[source]#

Execute the Reshape operation.

Parameters:
  • stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.

  • release_workspace – A value of True specifies that the Reshape object should release workspace memory back to the symmetric memory pool on function return, while a value of False specifies that the object should retain the memory. This option may be set to True if the application performs other operations that consume a lot of memory between successive calls to the (same or different) execute() API, but incurs an overhead due to obtaining and releasing workspace memory from and to the symmetric memory pool on every call. The default is False. NOTE: All processes must use the same value or the application can deadlock.

  • sync_symmetric_memory – Indicates whether to issue a symmetric memory synchronization operation on the execute stream before the reshape operation. Note that before the Reshape starts executing, it is required that the source operand be ready on all processes. A symmetric memory synchronization ensures completion and visibility by all processes of previously issued local stores to symmetric memory. Advanced users who choose to manage the synchronization on their own using the appropriate NVSHMEM API, or who know that GPUs are already synchronized on the source operand, can set this to False.

Returns:

The reshaped operand, which remains on the same device and utilizes the same package as the input operand. For GPU operands, the result will be in symmetric memory and the user is responsible for explicitly deallocating it (for example, using nvmath.distributed.free_symmetric_memory(tensor)).

free()[source]#

Free Reshape resources.

It is recommended that the Reshape object be used within a context, but if it is not possible then this method must be called explicitly to ensure that the Reshape resources (especially internal library objects) are properly cleaned up.

plan(stream=None)[source]#

Plan the Reshape.

Parameters:

stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.

reset_operand(operand=None, *, stream=None)[source]#

Reset the operand held by this Reshape instance. This method has two use cases:

  1. it can be used to provide a new operand for execution

  2. it can be used to release the internal reference to the previous operand and potentially make its memory available for other use by passing operand=None.

Parameters:
  • operand

    A tensor (ndarray-like object) compatible with the previous one or None (default). A value of None will release the internal reference to the previous operand and user is expected to set a new operand before again calling execute(). The new operand is considered compatible if all the following properties match with the previous one:

    • The operand distribution, which must be (input_box, output_box) where input_box and output_box are the boxes specified at plan time.

    • The package that the new operand belongs to.

    • The dtype of the new operand.

    • The shape and strides of the new operand.

    • The memory space of the new operand (CPU or GPU).

    • The device that new operand belongs to if it is on GPU.

  • stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used..

Examples

>>> import cupy as cp
>>> import nvmath.distributed

Get MPI communicator used to initialize nvmath.distributed (for information on initializing nvmath.distributed, you can refer to the documentation or to the Reshape examples in nvmath/examples/distributed/reshape):

>>> comm = nvmath.distributed.get_context().communicator
>>> nranks = comm.Get_size()

Create a 3-D complex128 ndarray on GPU symmetric memory, initially partitioned on the X axis (the global shape is (128, 128, 128)):

>>> shape = 128 // nranks, 128, 128
>>> dtype = cp.complex128
>>> a = nvmath.distributed.allocate_symmetric_memory(shape, cp, dtype=dtype)
>>> a[:] = cp.random.rand(*shape) + 1j * cp.random.rand(*shape)

Compute the input and output box for the desired re-distribution:

>>> input_box = ...
>>> output_box = ...

Create a Reshape object as a context manager

>>> with nvmath.distributed.reshape.Reshape(a, input_box, output_box) as f:
...     # Plan the Reshape
...     r.plan()
...
...     # Execute the Reshape to get the first result.
...     r1 = r.execute()
...
...     # Reset the operand to a new CuPy ndarray.
...     b = nvmath.distributed.allocate_symmetric_memory(shape, cp, dtype=dtype)
...     b[:] = cp.random.rand(*shape) + 1j * cp.random.rand(*shape)
...     f.reset_operand(b)
...
...     # Execute to get the new result corresponding to the updated operand.
...     r2 = f.execute()

With reset_operand(), minimal overhead is achieved as problem specification and planning are only performed once.

For the particular example above, explicitly calling reset_operand() is equivalent to updating the operand in-place, i.e, replacing f.reset_operand(b) with a[:]=b. Note that updating the operand in-place should be adopted with caution as it can only yield the expected result and incur no additional copies under the additional constraints below:

  • The operand’s distribution is the same.

For more details, please refer to inplace update example.