ViT Training#
This example shows how to adapt a single-device or DDP training loop to use domain parallelism, for both training and inference.
The data is synthetically generated image-like data in 2D or 3D. The training script can benchmark the model over a variety of image sizes.
The model is a convolutional embedding followed by 15 layers of Transformer blocks.
HybridViT(
(patch_embed): PatchEmbedding2d(
(conv): Conv2d(3, 768, kernel_size=(8, 8), stride=(8, 8))
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(stages): ModuleList(
(0-15): 16 x TransformerBlock(
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): MultiHeadAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(proj): Linear(in_features=768, out_features=768, bias=True)
)
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MLP(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
)
)
)
(head): Linear(in_features=768, out_features=1000, bias=True)
)
Number of parameters: 126907624
The script allows the user to control how large each parallelism axis is:
usage: training_script.py [-h] [--batch_size BATCH_SIZE] [--dimension {2,3}]
[--image_size_start IMAGE_SIZE_START] [--image_size_stop IMAGE_SIZE_STOP]
[--image_size_step IMAGE_SIZE_STEP]
[--ddp_size DDP_SIZE] [--domain_size DOMAIN_SIZE] [--use_mixed_precision]
Benchmark HybridViT model performance
options:
-h, --help show this help message and exit
--batch_size BATCH_SIZE
Global Batch size for training (default: 1)
--dimension {2,3} Dimension of the model: 2D or 3D (default: 2)
--image_size_start IMAGE_SIZE_START
Starting image size (default: 256)
--image_size_stop IMAGE_SIZE_STOP
Ending image size (default: 2048)
--image_size_step IMAGE_SIZE_STEP
Step size for image size progression (default: 128)
--ddp_size DDP_SIZE DDP world size (default: 1)
--domain_size DOMAIN_SIZE
Domain parallel size (default: 1)
--use_mixed_precision
Enable mixed precision training (default: False)
The model code is identical in all use cases: only the input data changes, and whether the model is wrapped in DDP or FSDP. Output will include a table of performance results.