Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Vision Transformer Training#

We have curated 5 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/vit directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.

Model	Model size (M)	Hidden size	FFN_dim	Attention heads	Number of layers	Batch Size per GPU	Accumulated Global Batch Size	Precision	AMP Level	Total Training Samples Seen
B/16	86	768	3072	12	12	512	4096	BF16	O2	400M
L/16	303	1024	4096	16	24	256	4096	BF16	O2	400M
H/14	632	1280	5120	16	32	128	4096	BF16	O2	400M
g/14	1011	1408	6144	16	40	64	4096	BF16	O2	400M
G/14	1843	1664	8192	16	48	32	4096	BF16	O2	400M

To enable the training stage with a Vision Transformer (ViT) model, configure the configuration files:

In the defaults section of conf/config.yaml, update the training field to point to the desired ViT configuration file. For example, if you want to use the B/16``(i.e. ``B_16) configuration, change the training field to vit/B_16.
```
defaults:
  - _self_
  - cluster: bcm
  - data_preparation: null
  - training: vit/vit_B_16
  ...
```
In the stages field of conf/config.yaml, make sure the training stage is included. For example,
```
stages:
  - training
  ...
```
Execute the launcher pipeline: python3 main.py.

Remarks: The correctness of our Vision Transformer implementation has been verified by pretraining ViT B/16 for 300 epochs on the ImageNet 1K dataset. This demonstrates that our implementation is consistent with the expected performance and results of Vision Transformers in general.