Domain Parallelism#

In large scale AI applications, spanning multiple GPUs, an AI programmer has multiple tools available for coordination of GPUs to scale an application. In PhysicsNeMo, we have focused on enabling one particular technique, called “domain parallelism”, which is designed to parallelize execution of a model over the input data. Several models in PhysicsNeMo enable this directly - such as MeshGraphNets and SFNO - while other models rely on more generic tools to enable domain parallelism.

To learn more about the domain parallel tools in PhysicsNemo, review the following tutorials:

Domain Parallelism and Shard Tensor

Provides an overview of what domain parallelism is, when you might need it, and how PhysicsNeMo is used to support it.

Domain Parallelism and Shard Tensor

Implementing New Layers

Provides more information into how domain parallelism is extended to operations that may not be supported yet - and especially how you might use the tools in PyTorch and PhysicsNeMo to extend domain parallelism yourself.

Implementing New Layers for ShardTensor

ShardTensor and FSDP Tutorial

Provides an end-to-end example with synthetic data to show you how domain parallelism can be combined with other parallelism paradigms, like data parallel training.

Domain Decomposition, ShardTensor, and FSDP Tutorial

If you have questions about domain parallelism and its applications in scientific AI, find us on GitHub to discuss.