Multi-Process support - cuTENSORMp (Beta)#

cuTENSORMp is a multi-process extension of cuTENSOR for distributed tensor contractions. It coordinates tensor computations across multiple processes while reusing cuTENSOR’s single-process kernels for local work.

The overall programming model closely follows cuTENSOR:

Create and manage cuTENSORMp handle,
Create (distributed) tensors and operation descriptors,
Create and optimize execution plans using workspace budgets, and
Finally launch distributed contractions.

This document summarizes performance, accuracy, scalar-type rules, CUDA Graph considerations, and logging for cuTENSORMp.

Supported operation and current limitations#

The current cuTENSORMp implementation focuses on a single operation type of the form

\[C = \alpha \cdot \mathrm{contract}(A, B, \text{eq}) + \beta \cdot C ,\]

where eq is an Einstein summation notation string that defines the element-wise relationship between tensors \(A\), \(B\) and the output \(C\).

At this stage, the implementation has the following important limitations:

Output accumulation
- The scaling factor \(\beta\) must be equal to zero. In other words, the current implementation computes
  
  \[C = \alpha \cdot \mathrm{contract}(A, B, \text{eq})\]
  
  and does not support accumulation into an existing output tensor.
Supported extents
- All tensor mode extents must be either 1 or 2. Larger mode extents are not yet supported.
Mode ordering in the equation
- The mode labels in eq must respect the canonical ordering assumed by the current implementation (the same ordering as used in the provided CSV examples).
- Concretely, in the output tensor the subsequence of modes that come from \(A\) must appear in the same relative order as in \(A\), and the subsequence of modes that come from \(B\) must appear in the same relative order as in \(B\). For example, the equation
  
  \[\mathrm{abcABC}, \mathrm{abcDE} \rightarrow \mathrm{ADBEC}\]
  
  is supported because the \(A\)-modes \((A,B,C)\) and the \(B\)-modes \((D,E)\) each preserve their original ordering in the output tensor \(C\), even though they are interleaved.
- Equations that reorder modes so that the relative order of \(A\)-only or \(B\)-only modes changes in the output are not currently supported.
Tensor B replication
- Tensor \(B\) must be replicated on all ranks. Distributing \(B\) across ranks is not currently supported.
Tensors A and C distribution
- Tensors \(A\) and \(C\) must either be fully distributed over all ranks or fully replicated. Partially distributed layouts for \(A\) or \(C\) are not yet supported.
Distributed modes vs. reduced (contracted) modes
- Distributed modes for \(A\) must not correspond to the reduced (contracted) \(K\)-modes in the equation (i.e., indices that are summed out). Only non-contracted / output modes may be used for distribution in the current implementation.

Future version of cuTENSORMp will resolve those limitations.

CUDA Graph Support#

cuTENSORMp orchestrates both device-side cuTENSOR kernels and host-side collective communication (NCCL). While cuTENSOR kernels themselves follow the usual CUDA semantics, the presence of host-side communication and other host operations imposes limitations on CUDA Graph capture:

Library-level distributed contractions are not guaranteed to be CUDA-graph-capturable end-to-end, because host-side communication and other host operations may not be supported inside CUDA Graph capture.
In some advanced use cases, users may manually capture pure device regions that only involve cuTENSOR kernels, but this requires isolating those kernels from communication and other host activity.

If you need CUDA Graphs, follow the CUDA and cuTENSOR guidelines and carefully verify that all participating operations are capturable in your environment.