reshape#

nvmath.distributed.reshape.reshape(
operand,
/,
input_box,
output_box,
*,
sync_symmetric_memory: bool = True,
options=None,
stream=None,
)[source]#

reshape(operand, input_box, output_box, sync_symmetric_memory=True, options=None, stream=None)

Perform a distributed reshape on the provided operand to change its distribution across processes.

Parameters:
  • operand

    A tensor (ndarray-like object). The currently supported types are numpy.ndarray, cupy.ndarray, and torch.Tensor.

    Important

    GPU operands must be on the symmetric heap (for example, allocated with nvmath.distributed.allocate_symmetric_memory()).

  • input_box – The box specifying the distribution of the input operand across processes, where each process specifies which portion of the global array it holds. A box is a pair of coordinates specifying the lower and upper extent for each dimension.

  • output_box – The box specifying the distribution of the result across processes, where each process specifies which portion of the global array it will hold after reshaping. A box is a pair of coordinates specifying the lower and upper extent for each dimension.

  • sync_symmetric_memory – Indicates whether to issue a symmetric memory synchronization operation on the execute stream before the reshape operation. Note that before the Reshape starts executing, it is required that the source operand be ready on all processes. A symmetric memory synchronization ensures completion and visibility by all processes of previously issued local stores to symmetric memory. Advanced users who choose to manage the synchronization on their own using the appropriate NVSHMEM API, or who know that GPUs are already synchronized on the source operand, can set this to False.

  • options – Specify options for the Reshape as a ReshapeOptions object. Alternatively, a dict containing the parameters for the ReshapeOptions constructor can also be provided. If not specified, the value will be set to the default-constructed ReshapeOptions object.

  • stream – Provide the CUDA stream to use for executing the operation. Acceptable inputs include cudaStream_t (as Python int), cupy.cuda.Stream, and torch.cuda.Stream. If a stream is not provided, the current stream from the operand package will be used.

Returns:

A tensor that remains on the same device and belongs to the same package as the input operand, with shape according to output_box.

See also

Reshape.

Examples

>>> import cupy as cp
>>> import nvmath.distributed

Get MPI communicator used to initialize nvmath.distributed (for information on initializing nvmath.distributed, you can refer to the documentation or to the Reshape examples in nvmath/examples/distributed/reshape):

>>> comm = nvmath.distributed.get_context().communicator

Let’s create a 3D floating-point ndarray on GPU, distributed across a certain number of processes, with each holding a portion of the ndarray. As an example, process 0 holds a 3D box of the global 3D array of shape (4, 4, 4).

>>> shape = 4, 4, 4

Reshape uses the NVSHMEM PGAS model, which requires GPU operands to be on the symmetric heap:

>>> if comm.Get_rank() == 0:
...     a[:] = cp.random.rand(*shape)
... else:
...     a = ...  # each process holds a different section of the global array.
... a = nvmath.distributed.allocate_symmetric_memory(shape, cp)

With Reshape, we will change how the ndarray is distributed, by having each process specify the input and output section of the global array. For process 0, let’s assume that it holds the 3D box that goes from the lower corner given by coordinates (0, 0, 0) to the upper corner (4, 4, 4).

NOTE: each process has its own input and output boxes which are different to those of other processes, as each holds a different section of the global array.

>>> if comm.Get_rank() == 0:
...     input_lower = (0, 0, 0)
...     input_upper = (4, 4, 4)
...     input_box = [input_lower, input_upper]
...     output_box = ...
... else:
...     input_box = ...  # the input box depends on the process.
...     output_box = ...  # the output box depends on the process.

Perform the distributed reshape using reshape():

>>> r = nvmath.distributed.reshape.reshape(a, input_box, output_box)

See ReshapeOptions for the complete list of available options.

The package current stream is used by default, but a stream can be explicitly provided to the Reshape operation. This can be done if the Reshape operand is computed on a different stream, for example:

>>> s = cp.cuda.Stream()
>>> with s:
...     a = nvmath.distributed.allocate_symmetric_memory(shape, cp)
...     a[:] = cp.random.rand(*shape)
>>> r = nvmath.distributed.reshape.reshape(a, stream=s)

The operation above runs on stream s and is ordered with respect to the input computation.

Create a NumPy ndarray on the CPU.

>>> import numpy as np
>>> b = np.random.rand(*shape)

Provide the NumPy ndarray to reshape(), with the result also being a NumPy ndarray:

>>> r = nvmath.distributed.reshape.reshape(b, input_box, output_box)

Notes

  • This function is a convenience wrapper around Reshape and and is specifically meant for single use. The same computation can be performed with the stateful API.

Further examples can be found in the nvmath/examples/distributed/reshape directory.