Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Vision Transformer Training

We have curated 5 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/vit directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.

Model

Model size (M)

Hidden size

FFN_dim

Attention heads

Number of layers

Batch Size per GPU

Accumulated Global Batch Size

Precision

AMP Level

Total Training Samples Seen

B/16

86

768

3072

12

12

512

4096

BF16

O2

400M

L/16

303

1024

4096

16

24

256

4096

BF16

O2

400M

H/14

632

1280

5120

16

32

128

4096

BF16

O2

400M

g/14

1011

1408

6144

16

40

64

4096

BF16

O2

400M

G/14

1843

1664

8192

16

48

32

4096

BF16

O2

400M

To enable the training stage with a Vision Transformer (ViT) model, configure the configuration files:

  1. In the defaults section of conf/config.yaml, update the training field to point to the desired ViT configuration file. For example, if you want to use the B/16``(i.e. ``B_16) configuration, change the training field to vit/B_16.

    defaults:
      - _self_
      - cluster: bcm
      - data_preparation: null
      - training: vit/vit_B_16
      ...
    
  2. In the stages field of conf/config.yaml, make sure the training stage is included. For example,

    stages:
      - training
      ...
    
  3. Execute the launcher pipeline: python3 main.py.

Remarks: The correctness of our Vision Transformer implementation has been verified by pretraining ViT B/16 for 300 epochs on the ImageNet 1K dataset. This demonstrates that our implementation is consistent with the expected performance and results of Vision Transformers in general.