Distributed runtime#

Initializing the distributed runtime#

To use the distributed APIs, you must first initialize the distributed runtime. This is done by having each process provide a local CUDA device ID (referring to a GPU on the host on which that process runs), the process group, and the desired communication backends. For example:

import nvmath.distributed
from nvmath.distributed import MPIProcessGroup
from mpi4py import MPI
process_group = MPIProcessGroup(MPI.COMM_WORLD)  # can use any MPI communicator
nvmath.distributed.initialize(device_id, process_group, backends=["nvshmem", "nccl"])

The process group specifies the set of processes that will participate in subsequent calls to distributed APIs. The process group type is tied to the bootstrapping method (e.g. MPI or torch.distributed).

Note

nvmath-python supports both MPI and torch.distributed for bootstrapping and setup. Additionally, developers can provide their own implementation of nvmath.distributed.ProcessGroup to add support for new bootstrapping schemes.

Important

The bootstrapping method is only used for initialization and setup, not for compute.

Tip

Distributed FFT requires the NVSHMEM backend.

Distributed matrix multiplication requires the NCCL backend.

After initializing the distributed runtime you may use the distributed APIs. Certain APIs such as FFT and Reshape require GPU operands to be allocated on the NVSHMEM symmetric memory heap. Refer to Distributed API Utilities for examples and details of how to manage GPU operands on this type of symmetric memory.

Initialize with MPI process group#

An nvmath.distributed.MPIProcessGroup specifies a set of processes that were launched using MPI (e.g. with mpiexec). You can construct an MPIProcessGroup from any mpi4py communicator, and provide it to nvmath.distributed.initialize().

Initialize with torch.distributed process group#

A nvmath.distributed.TorchProcessGroup specifies a set of processes that communicate using torch.distributed (e.g. launched with torchrun).

You can construct a TorchProcessGroup by providing a torch.distributed process group handle, or None to use the default PyTorch process group. The resulting TorchProcessGroup can then be passed to nvmath.distributed.initialize().

Note

If the torch.distributed process group internally uses a GPU communication backend (such as NCCL), when creating the TorchProcessGroup you must provide the device ID used by said backend on this process.

API Reference#

initialize(device_id, process_group, backends)

Initialize nvmath.distributed runtime.

finalize()

Finalize nvmath.distributed runtime (this is called automatically at exit if the runtime is initialized).

ProcessGroup()

A ProcessGroup represents a set of processes collectively running nvmath.distributed operations.

MPIProcessGroup(mpi_comm)

ProcessGroup implemented on mpi4py.

TorchProcessGroup(*, device_id[, ...])

ProcessGroup implemented on torch.distributed.

get_context()

Return the distributed runtime's context or None if not initialized.

DistributedContext(device_id, process_group, ...)

Context of initialized nvmath.distributed runtime.