Domain Decomposition, ShardTensor, and FSDP Tutorial#

In this tutorial, we will see how to combine domain parallelism, ShardTensor, and a training or inference recipe. Before starting this tutorial, we recommend that you read the other domain parallelism tutorials:

This tutorial demonstrates how to use PhysicsNeMo’s ShardTensor functionality alongside PyTorch’s FSDP (Fully Sharded Data Parallel) to train or evaluate a simple ViT. Here’s what’s in the tutorial:

ViT Model Overview
Benchmarking the ViT on a single GPU
Enabling domain parallelism with ShardTensor
Training and evaluating the model with domain parallelism

Basic ViT Model#

The model we’ll use for this tutorial is a straightforward ViT. It’s very similar to the original vision transformer from “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” Dosovitskiy et al.. The model consists of two main conceptual pieces:

a convolutional tokenizer: it is a convolution with stride==kernel_size (so, non-overlapping image pieces) followed by a reshape to a sequence-like tensor with channels last.
a transformer block with residual attention and a residual MLP.

The overall model architecture is straightforward. The input image is tokenized using the convolutional tokenizer, a positional embedding is added, and then a series of transformer blocks are applied. At the end of the transformer layers, all of the tokens are averaged together. The entire architecture has one final layer to project the embedding dimension onto the output dimension.

Note

This isn’t really how you might implement a transformer for a vision classification task in practice - there are better, more sophisticated techniques. Since the original ViT publication, technical advances such as Convolution Transformers, Shifted Windows, Neighborhood Attention, and others have outperformed basic ViTs like this for classification. We encourage you to pick the model architecture most suitable for your task. To demonstrate the domain parallel techniques, we’ve picked a “Standard” vision transformer here.

Here’s the core of the model:

Model Implementation

Listing 10 Example 1: ViT Model#

# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import torch
import torch.nn as nn

from .PatchEmbed2d import PatchEmbedding2d
from .PatchEmbed3d import PatchEmbedding3d
from .TransformerBlock import TransformerBlock


class HybridViT(nn.Module):
    """
    Hybrid Vision Transformer with conv patch embedding and multiple transformer layers.

    Args:
        img_size: Input image size
        patch_size: Size of patches for tokenization
        in_channels: Number of input channels
        num_classes: Number of classes for classification
        embed_dim: Embedding dimension (same for all layers)
        num_heads: Number of attention heads for each stage
        depth: Number of transformer layers
        mlp_ratio: MLP ratios for each layer
        qkv_bias: Whether to use bias in QKV projections
    """

    def __init__(
        self,
        img_size: int = [256, 256],
        patch_size: int = 8,
        in_channels: int = 3,
        num_classes: int = 1000,
        embed_dim: int = 768,
        num_heads: int = 6,
        depth: int = 16,
        mlp_ratio: float = 4.0,
        qkv_bias: bool = True,
    ) -> None:
        super().__init__()

        # Use the image size to select the padding:
        if len(img_size) == 2:
            self.patch_embed = PatchEmbedding2d(
                img_size=img_size,
                patch_size=patch_size,
                in_channels=in_channels,
                embed_dim=embed_dim,
            )
        elif len(img_size) == 3:
            self.patch_embed = PatchEmbedding3d(
                img_size=img_size,
                patch_size=patch_size,
                in_channels=in_channels,
                embed_dim=embed_dim,
            )

        # Positional embeddings (for patches + CLS token)
        self.pos_embed = nn.Parameter(
            torch.zeros(1, self.patch_embed.num_patches, embed_dim)
        )

        # Build transformer stages (all operating on same resolution)
        self.stages = nn.ModuleList(
            [
                TransformerBlock(
                    dim=embed_dim,
                    num_heads=num_heads,
                    mlp_ratio=mlp_ratio,
                    qkv_bias=qkv_bias,
                )
                for _ in range(depth)
            ]
        )

        # Classification head
        self.head = (
            nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
        )

    def forward_features(self, x: torch.Tensor) -> torch.Tensor:
        """Extract features through all stages.

        Args:
            x: Input tensor of shape (B, C, H, W)

        Returns:
            CLS token features of shape (B, embed_dim)
        """
        B = x.shape[0]

        # Patch embedding
        x = self.patch_embed(x)  # B, N, C

        # Add positional embeddings
        x = x + self.pos_embed

        # Apply transformer stages
        for stage in self.stages:
            x = stage(x)

        # Return the mean of all tokens
        return x.mean(dim=(1,))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Full forward pass for classification.

        Args:
            x: Input tensor of shape (B, C, H, W)

        Returns:
            Classification logits of shape (B, num_classes)
        """
        x = self.forward_features(x)
        x = self.head(x)
        return x

For more information on the components, expand the following sections to see the code:

Patch Embedding Implementations

2D

Listing 11 Convolutional Patch Embedding in 2D#

# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import torch
import torch.nn as nn
from einops import rearrange


class PatchEmbedding2d(nn.Module):
    """Single patch embedding layer that tokenizes and embeds input 2D images."""

    def __init__(
        self,
        img_size: tuple[int],
        patch_size: int = 16,
        in_channels: int = 3,
        embed_dim: int = 768,
    ) -> None:
        super().__init__()
        for i in img_size:
            assert i % patch_size == 0, (
                f"Image size {i} must be divisible by patch size {patch_size}"
            )

        self.img_size = img_size
        self.patch_size = patch_size
        self.num_patches = (img_size[0] // patch_size) * (img_size[1] // patch_size)

        # Single convolution that acts as both tokenizer and linear embedding
        self.conv = nn.Conv2d(
            in_channels, embed_dim, kernel_size=patch_size, stride=patch_size
        )
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Convert image to patch embeddings.

        Args:
            x: Input tensor of shape (B, C, H, W)

        Returns:
            Patch embeddings of shape (B, num_patches, embed_dim)
        """
        x = self.conv(x)
        # Rearrange to apply LayerNorm correctly: BCHW -> B(HW)C
        x = rearrange(x, "b c h w -> b (h w) c")
        x = self.norm(x)
        # Keep in BHWC format for efficient downstream processing
        x = nn.functional.relu(x)

        return x

3D

Listing 12 Convolutional Patch Embedding in 3D#

# SPDX-FileCopyrightText: Copyright (c) 2023 - 2025 NVIDIA CORPORATION & AFFILIATES.
# SPDX-FileCopyrightText: All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import torch
import torch.nn as nn
from einops import rearrange


class PatchEmbedding3d(nn.Module):
    """Single patch embedding layer that tokenizes and embeds input 3D images."""

    def __init__(
        self,
        img_size: tuple[int],
        patch_size: int = 16,
        in_channels: int = 3,
        embed_dim: int = 768,
    ) -> None:
        super().__init__()
        for i in img_size:
            assert i % patch_size == 0, (
                f"Image size {i} must be divisible by patch size {patch_size}"
            )

        self.img_size = img_size
        self.patch_size = patch_size
        self.num_patches = (
            (img_size[0] // patch_size)
            * (img_size[1] // patch_size)
            * (img_size[2] // patch_size)
        )

        # Single convolution that acts as both tokenizer and linear embedding
        self.conv = nn.Conv3d(
            in_channels, embed_dim, kernel_size=patch_size, stride=patch_size
        )
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Convert image to patch embeddings.

        Args:
            x: Input tensor of shape (B, C, H, W, D)

        Returns:
            Patch embeddings of shape (B, num_patches, embed_dim)
        """
        x = self.conv(x)
        # Rearrange to apply LayerNorm correctly: BCHWD -> B(HWD)C
        x = rearrange(x, "b c h w d -> b (h w d) c")
        x = self.norm(x)
        # Keep in BHWC format for efficient downstream processing
        x = nn.functional.relu(x)

        return x

Benchmark Results#

Benchmark results can be useful for deciding when to use ShardTensor or DDP. We recommend that you use ShardTensor when you can’t fit batch_size==1 on a single GPU.

1024x1024 2D Image#

At a resolution of 1024 pixels on a side, our baseline ViT shows reasonable performance on a single GPU.

We can keep the per-GPU batch size fixed, scale out with DDP, and get very good scaling.

We can also scale in two directions and see that latency, at fixed global batch size, decreases; however, ShardTensor isn’t ideal in this regime:

Training Throughput

Training Throughput (Images / second), at 1024 pixels per side, decreases with more GPUs per image (that is, using ShardTensor), but total throughput is highest with each GPU responsible for a full image.

GPUS / Image	B=1	B=2	B=4	B=8
1	0.46	0.91	1.8	3.6
2	0.76	1.6	3.1
4	1.3	2.7
8	1.9

Training Memory Usage

Training Memory Usage (GB) At this resolution, the model uses only 14 GB of GPU memory per image out of the available 80 GB total.

GPUS / Image	B=1	B=2	B=4	B=8
1	13.9	14.4	14.4	14.4
2	7.6	7.4	7.1
4	4.5	4.2
8	2.9

ShardTensor, in most operations, does add a little overhead. Most of the kernels that benefit from domain parallelism require communication between GPUs and efficiency increases as the computational size increases from 1024 squared to 2048 squared:

Latency

Latency per step (s) The processing time increases linearly with the number of tokens in each layer, but tokens scale as the resolution squared.

GPUs	Inference 1024	Train 1024	Inference 2048	Train 2048
1	0.55	7.96	2.2	31.4
2	0.32	4.13	1.32	16.4
4	0.19	2.23	0.76	8.78
8	0.13	1.33	0.54	5.02

Speedup

Speedup After a certain data size, ShardTensor is always faster with more GPUs. But, larger images show bigger benefits.

GPUs	Inference 1024	Train 1024	Inference 2048	Train 2048
1	1.0	1.0	1.0	1.0
2	1.7	1.9	1.7	1.9
4	2.9	3.6	2.9	3.6
8	4.2	6.0	4.1	6.3

Memory Usage

Memory Usage (GB) Like latency, memory usage in training scales roughly like the number of tokens. For inference, it’s driven mostly by model size.

GPUs	Inference 1024	Train 1024	Inference 2048	Train 2048
1	2.5	13.9	4.6	51.4
2	2.2	7.6	3.8	26.5
4	2.0	4.5	2.8	13.9
8	1.9	2.9	2.3	7.6

Memory Reduction

Memory Reduction (%) For highest resolution data, we obtain close to linear reduction in memory with more GPUs.

GPUs	Inference 1024	Train 1024	Inference 2048	Train 2048
1	100%	100%	100%	100%
2	88%	55%	83%	52%
4	80%	32%	61%	27%
8	76%	21%	50%	15%

If you are tracking the memory scaling performance of this model, you’ll see that the training memory at higher resolution is roughly proportional to the total number of pixels in the image. At 51.4 GB of training memory for 2048x2048 sized images, we expect the next doubling (4096x4096 pixels) to require more than 200 GB of memory per GPU.

Using ShardTensor, we can run it out of the box on 8 GPUs and we see about 26 GB of memory used per GPU, as expected. You can also run large-scale 3D vision models like this. However, because memory usage scales with the cube of the resolution (rather than the square, as in 2D), memory issues arise even faster.

Review of Tutorial Steps#

This tutorial covered the key steps to enable ShardTensor in your model. ShardTensor performance and broad layer support are still evolving. Many key models will work out of the box, while others contain operations that are not yet fully supported. If you have specific requests for support, open an issue on GitHub and review the tutorial for Implementing New Layers for ShardTensor.

Summary of the Workflow for 2D Domain Parallelism

Define the Device Mesh
Split the mesh into two dimensions: one for data parallelism (FSDP) and one for spatial decomposition (ShardTensor).

Example:
mesh = dm.initialize_mesh((-1, 2), mesh_dim_names=["data", "spatial"])
For multilevel parallelism, the mesh can be extended to additional dimensions. A DeviceMesh can be conceptualized as an N-dimensional tensor where each element is one GPU, and each dimension of the tensor is one axis of parallelism.
Shard Input Data

Distribute the input tensor across the spatial dimension using ShardTensor.
Handle Parameters

Use FSDP to shard parameters and optimize across the data-parallel dimension.
Scale Spatial Dimensions

Larger spatial dimensions can be processed efficiently by distributing computation across devices.

You are now ready to scale your models and data to very high resolutions using ShardTensor.

Domain Decomposition, ShardTensor, and FSDP Tutorial#

Basic ViT Model#

Running the ViT#

Setting Up the Environment#

Run Configuration#

Distributed Configuration#

Preparing the Inputs#

Configure the Model#

Benchmark Results#

1024x1024 2D Image#

Review of Tutorial Steps#