Distributed runtime#
Initializing the distributed runtime#
To use the distributed APIs, you must first initialize the distributed runtime. This is done by having each process provide a local CUDA device ID (referring to a GPU on the host on which that process runs), the process group, and the desired communication backends. For example:
import nvmath.distributed
from nvmath.distributed import MPIProcessGroup
from mpi4py import MPI
process_group = MPIProcessGroup(MPI.COMM_WORLD) # can use any MPI communicator
nvmath.distributed.initialize(device_id, process_group, backends=["nvshmem", "nccl"])
The process group specifies the set of processes that will participate in subsequent calls to distributed APIs. The process group type is tied to the bootstrapping method (e.g. MPI or torch.distributed).
Note
nvmath-python supports both MPI and
torch.distributed for
bootstrapping and setup. Additionally, developers can provide their own implementation
of nvmath. to add support for new bootstrapping
schemes.
Important
The bootstrapping method is only used for initialization and setup, not for compute.
Tip
Distributed FFT requires the NVSHMEM backend.
Distributed matrix multiplication requires the NCCL backend.
After initializing the distributed runtime you may use the distributed APIs. Certain APIs such as FFT and Reshape require GPU operands to be allocated on the NVSHMEM symmetric memory heap. Refer to Distributed API Utilities for examples and details of how to manage GPU operands on this type of symmetric memory.
Initialize with MPI process group#
An nvmath. specifies a set of processes that were
launched using MPI (e.g. with mpiexec). You can construct an MPIProcessGroup from any
mpi4py communicator, and provide it to nvmath..
Initialize with torch.distributed process group#
A nvmath. specifies a set of processes that
communicate using torch.distributed (e.g. launched with torchrun).
You can construct a TorchProcessGroup by providing a torch.distributed process
group handle, or None to use the default PyTorch process group. The resulting
TorchProcessGroup can then be passed to nvmath..
Note
If the torch.distributed process group internally uses a GPU communication
backend (such as NCCL), when creating the TorchProcessGroup you must provide
the device ID used by said backend on this process.
API Reference#
|
Initialize |
|
Finalize |
A ProcessGroup represents a set of processes collectively running |
|
|
ProcessGroup implemented on mpi4py. |
|
ProcessGroup implemented on |
Return the distributed runtime's context or None if not initialized. |
|
Context of initialized |