Training with Predefined Configurations

We have curated 3 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/clip directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.

Model	Image size	Text Model size (M)	Image Model size (M)	Output dim	Batch Size per GPU	Accumulated Global Batch Size	Precision	AMP Level	Total Training Samples Seen
ViT B/32	224	63	87	512	500	32000	BF16	O2	12B
ViT L/14	224	123	303	768	112	32256	BF16	O2	12B
ViT H/14	224	354	638	1024	80	32000	BF16	O2	12B

To enable the training stage with a CLIP model, configure the configuration files:

In the defaults section of conf/config.yaml, update the training field to point to the desired CLIP configuration file. For example, if you want to use the ViT B/32 (i.e. vit_B_32), change the training field to clip/vit_B_32.
```
defaults:
  - _self_
  - cluster: bcm
  - data_preparation: multimodal/download_multimodal
  - training: clip/vit_B_32
  ...
```
In the stages field of conf/config.yaml, make sure the training stage is included. For example,
```
stages:
  - data_preparation
  - training
  ...
```
Execute the launcher pipeline: python3 main.py.

Remarks:

NeMo CLIP does not yet support gradient accumulation. Therefore, please ensure micro_batch_size * num_gpus = global_batch_size (i.e. gradient accumulation step is 1).
For CLIP models, you can enable Exponential Moving Average (EMA) by setting training.exp_manager.ema.enable=True. However, EMA is currently not compatible with AMP O2. To use EMA, you must disable AMP O2 by setting training.model.megatron_amp_O2=False. Enabling EMA can help your model converge faster, but be aware that it may result in a slight performance penalty.