Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Vision Transformer Training#
We have curated 5 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is
equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/vit
directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By
customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and
requirements.
Model |
Model size (M) |
Hidden size |
FFN_dim |
Attention heads |
Number of layers |
Batch Size per GPU |
Accumulated Global Batch Size |
Precision |
AMP Level |
Total Training Samples Seen |
---|---|---|---|---|---|---|---|---|---|---|
B/16 |
86 |
768 |
3072 |
12 |
12 |
512 |
4096 |
BF16 |
O2 |
400M |
L/16 |
303 |
1024 |
4096 |
16 |
24 |
256 |
4096 |
BF16 |
O2 |
400M |
H/14 |
632 |
1280 |
5120 |
16 |
32 |
128 |
4096 |
BF16 |
O2 |
400M |
g/14 |
1011 |
1408 |
6144 |
16 |
40 |
64 |
4096 |
BF16 |
O2 |
400M |
G/14 |
1843 |
1664 |
8192 |
16 |
48 |
32 |
4096 |
BF16 |
O2 |
400M |
To enable the training stage with a Vision Transformer (ViT) model, configure the configuration files:
In the
defaults
section ofconf/config.yaml
, update thetraining
field to point to the desired ViT configuration file. For example, if you want to use theB/16``(i.e. ``B_16
) configuration, change thetraining
field tovit/B_16
.defaults: - _self_ - cluster: bcm - data_preparation: null - training: vit/vit_B_16 ...
In the
stages
field ofconf/config.yaml
, make sure the training stage is included. For example,stages: - training ...
Execute the launcher pipeline:
python3 main.py
.
Remarks: The correctness of our Vision Transformer implementation has been verified by pretraining ViT B/16
for
300 epochs on the ImageNet 1K dataset. This demonstrates that our implementation is consistent with the expected
performance and results of Vision Transformers in general.