Vision Transformer Training

User Guide (Latest Version)

We have curated 5 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/vit directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.


Model size (M)

Hidden size


Attention heads

Number of layers

Batch Size per GPU

Accumulated Global Batch Size


AMP Level

Total Training Samples Seen

B/16 86 768 3072 12 12 512 4096 BF16 O2 400M
L/16 303 1024 4096 16 24 256 4096 BF16 O2 400M
H/14 632 1280 5120 16 32 128 4096 BF16 O2 400M
g/14 1011 1408 6144 16 40 64 4096 BF16 O2 400M
G/14 1843 1664 8192 16 48 32 4096 BF16 O2 400M

To enable the training stage with a Vision Transformer (ViT) model, configure the configuration files:

  1. In the defaults section of conf/config.yaml, update the training field to point to the desired ViT configuration file. For example, if you want to use the B/16``(i.e. ``B_16) configuration, change the training field to vit/B_16.


    defaults: - _self_ - cluster: bcm - data_preparation: null - training: vit/vit_B_16 ...

  2. In the stages field of conf/config.yaml, make sure the training stage is included. For example,


    stages: - training ...

  3. Execute launcher pipeline: python3

Remarks: The correctness of our Vision Transformer implementation has been verified by pretraining ViT B/16 for 300 epochs on the ImageNet 1K dataset. This demonstrates that our implementation is consistent with the expected performance and results of Vision Transformers in general.

Previous Data Preparation
Next Fine-tuning
© | | | | | | |. Last updated on Jun 19, 2024.