Distributed Linear Algebra#

Overview#

The distributed Linear Algebra module nvmath.distributed.linalg.advanced in nvmath-python leverages the NVIDIA cuBLASMp library and provides a powerful suite of APIs that can be directly called from the host to efficiently perform matrix multiplications on multi-node multi-GPU systems at scale. Both stateless function-form APIs and stateful class-form APIs are provided.

The distributed matrix multiplication APIs are similar to their non-distributed host API counterparts, with some key differences:

The operands to the API on each process are the local partition of the global operands and the user specifies the distribution (how the data is partitioned across processes). The APIs natively support the block-cyclic distribution (see Block distributions).
The APIs optionally support GPU operands on symmetric memory. Refer to Distributed API Utilities for examples and details of how to manage symmetric memory GPU operands.

Operand distribution#

To perform a distributed operation, first you have to specify how the operand is distributed across processes. Distributed matrix multiply natively supports the block-cyclic distribution (see Block distributions), therefore you must provide a distribution compatible with block-cyclic. Compatible distributions include BlockCyclic, BlockNonCyclic and Slab (with uniform partition sizes).

Memory layout#

cuBLASMp requires operands to use Fortran-order memory layout, while Python libraries such as NumPy and PyTorch use C-order by default. See Distribution, memory layout and transpose for guidelines on memory layout conversion for distributed operands and potential implications on distribution.

Matrix qualifiers#

Matrix qualifiers are used to indicate whether an input matrix is transposed or not.

For example, for A.T @ B you have to specify:

from nvmath.distributed.linalg.advanced import matrix_qualifiers_dtype, matmul

qualifiers = np.zeros((3,), dtype=matrix_qualifiers_dtype)
qualifiers[0]["is_transpose"] = True  # a is transposed
qualifiers[1]["is_transpose"] = False  # b is not transposed (optional)

...

result = matmul(a, b, distributions=distributions, qualifiers=qualifiers)

Caution

A common strategy to convert memory layout to Fortran-order (required by cuBLASMp) is to transpose the input matrices, as explained in Distribution, memory layout and transpose. Remember to set the matrix qualifiers accordingly.

Distributed algorithm#

cuBLASMp implements efficient communication-overlap algorithms that are suited for distributed machine learning scenarios with tensor parallelism. Algorithms include AllGather+GEMM and GEMM+ReduceScatter. These algorithms have special requirements in terms of how each of the operands is distributed and their transpose qualifiers.

Currently, to be able to use these algorithms the matrices must be distributed using a 1D partitioning scheme without the cyclic distribution and the partition sizes must be uniform (BlockNonCyclic and Slab are valid distributions for this use case).

Please refer to cuBLASMp documentation for full details.

Symmetric memory#

Operands may be allocated on the symmetric heap. If so, the result will also be allocated on the symmetric heap.

Tip

Certain distributed matrix multiplication algorithms may perform better when the operands are on symmetric memory.

Important

Any memory on the symmetric heap that is owned by the user (including the distributed Matmul result) must be deleted explicitly using free_symmetric_memory(). Refer to Distributed API Utilities for more information.

See example.

Example#

The following example performs \(\alpha A @ B + \beta C\) with inputs distributed according to a Slab distribution (partitioning on a single dimension):

Tip

Reminder to initialize the distributed context first as per Initializing the distributed runtime and to select both NVSHMEM and NCCL as communication backends.

import cupy as cp
from nvmath.distributed.distribution import Slab
from nvmath.distributed.linalg.advanced import matrix_qualifiers_dtype

# Get my process rank from mpi4py communicator.
rank = communicator.Get_rank()

# The global problem size m, n, k
m, n, k = 128, 512, 1024

# Prepare sample input data.
with cp.cuda.Device(device_id):
    a = cp.random.rand(*Slab.X.shape(rank, (m, k)))
    b = cp.random.rand(*Slab.X.shape(rank, (n, k)))
    c = cp.random.rand(*Slab.Y.shape(rank, (n, m)))

# Get transposed views with Fortran-order memory layout
a = a.T  # a is now (k, m) with Slab.Y
b = b.T  # b is now (k, n) with Slab.Y
c = c.T  # c is now (m, n) with Slab.X

distributions = [Slab.Y, Slab.Y, Slab.X]

qualifiers = np.zeros((3,), dtype=matrix_qualifiers_dtype)
qualifiers[0]["is_transpose"] = True  # a is transposed

alpha = 0.45
beta = 0.67

# Perform the distributed GEMM.
result = nvmath.distributed.linalg.advanced.matmul(
    a,
    b,
    c=c,
    alpha=alpha,
    beta=beta,
    distributions=distributions,
    qualifiers=qualifiers,
)

# Synchronize the default stream, since by default the execution
# is non-blocking for GPU operands.
cp.cuda.get_current_stream().synchronize()

# result is distributed row-wise
assert result.shape == Slab.X.shape(rank, (m, n))

You can find many more examples here.

API Reference#