Multi-GPU support - cuTENSORMg#

An introduction to cuTENSORMg together with a discussion of its data structures can be found in this blog post.

Performance Guidelines#

cuTENSORMg largely has the same performance guidelines as cuTENSOR, with some additional considerations:

Try to fully distribute tensors across all GPUs to use most of the available bandwidth.
Try to keep product of block sizes sufficiently large (i.e., large enough for a contraction with the block sizes to be efficient).
- Depending on the data type, contractions starts to reach high efficiency between 1000 and 4000 elements in the combined extent of contracted and non-contracted modes.
- Consequently, the product of all blockings should reach that threshold.
- For example, a 2048-element blocking in a single mode would be sufficient, or a 64 by 64 blocking in two modes.
Try to distribute tensors across both contracted modes as well as non-contracted modes from A and B.
Try to keep stride-1 modes sufficiently large (i.e., larger than 32 elements).
Try to choose blockings that contain small prime factors, in particular powers of two to preserve alignment.
Try to distribute tensors over many modes. As modes can be contracted or non-contracted w.r.t. different tensor contractions, distributing across many modes hedges against degenerate distributions.
The performance currently strongly depends on the relation between device count, block size and problem size. To get best performance, try to keep the combined block size a multiple of the problem size divided by the device count. Ideally, that multiple is larger than about 4.

Accuracy Guarantees#

cuTENSORMg follows the same accuracy guarantees that cuTENSOR provides, but may produce slightly lower accuracy for half precision types since the accumulator tile has half precision, whereas the kernel accumulators have single precision.

Scalar Types#

Many operations support multiplication of arguments by a scalar. The scalar type used follows the same rules as cuTENSOR Scalar Types.

CUDA Graph Support#

All operations in cuTENSORMg can be captured using CUDA graphs as long as they involve no host operations, i.e., none of the operands store data on the host.

cuTENSORMg Logging#

cuTENSORMg’s logging mechanism can be enabled by setting the following environment variables before launching the target application:

CUTENSORMG_LOG_LEVEL=<level> - while level is one of the following levels:

“0” - Off - logging is disabled (default)

“1” - Error - only errors will be logged

“2” - Trace - API calls that launch CUDA kernels will log their parameters and important information

“3” - Hints - hints that can potentially improve the application’s performance

“4” - Info - provides general information about the library execution, may contain details about heuristic status

“5” - API Trace - API calls will log their parameter and important information

CUTENSORMG_LOG_MASK=<mask> - while mask is a combination of the following masks:

“0” - Off

“1” - Error

“2” - Trace

“4” - Hints

“8” - Info

“16” - API Trace

CUTENSORMG_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may contain %i, that will be replaced with the process id. E.g “<file_name>_%i.log”.

If CUTENSORMG_LOG_FILE is not defined, the log messages are printed to stdout.