Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Vision Transformer Training
We have curated 5 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is
equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/vit
directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By
customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and
requirements.
Model |
Model size (M) |
Hidden size |
FFN_dim |
Attention heads |
Number of layers |
Batch Size per GPU |
Accumulated Global Batch Size |
Precision |
AMP Level |
Total Training Samples Seen |
---|---|---|---|---|---|---|---|---|---|---|
B/16 |
86 |
768 |
3072 |
12 |
12 |
512 |
4096 |
BF16 |
O2 |
400M |
L/16 |
303 |
1024 |
4096 |
16 |
24 |
256 |
4096 |
BF16 |
O2 |
400M |
H/14 |
632 |
1280 |
5120 |
16 |
32 |
128 |
4096 |
BF16 |
O2 |
400M |
g/14 |
1011 |
1408 |
6144 |
16 |
40 |
64 |
4096 |
BF16 |
O2 |
400M |
G/14 |
1843 |
1664 |
8192 |
16 |
48 |
32 |
4096 |
BF16 |
O2 |
400M |
To enable the training stage with a Vision Transformer (ViT) model, configure the configuration files:
In the
defaults
section ofconf/config.yaml
, update thetraining
field to point to the desired ViT configuration file. For example, if you want to use theB/16``(i.e. ``B_16
) configuration, change thetraining
field tovit/B_16
.defaults: - _self_ - cluster: bcm - data_preparation: null - training: vit/vit_B_16 ...
In the
stages
field ofconf/config.yaml
, make sure the training stage is included. For example,stages: - training ...
Execute the launcher pipeline:
python3 main.py
.
Remarks: The correctness of our Vision Transformer implementation has been verified by pretraining ViT B/16
for
300 epochs on the ImageNet 1K dataset. This demonstrates that our implementation is consistent with the expected
performance and results of Vision Transformers in general.