We have curated 3 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/clip directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.


Image size

Text Model size (M)

Image Model size (M)

Output dim

Batch Size per GPU

Accumulated Global Batch Size


AMP Level

Total Training Samples Seen

ViT B/32 224 63 87 512 500 32000 BF16 O2 12B
ViT L/14 224 123 303 768 112 32256 BF16 O2 12B
ViT H/14 224 354 638 1024 80 32000 BF16 O2 12B

To enable the training stage with a CLIP model, configure the configuration files:

  1. In the defaults section of conf/config.yaml, update the training field to point to the desired CLIP configuration file. For example, if you want to use the ViT B/32 (i.e. vit_B_32), change the training field to clip/vit_B_32.


    defaults: - _self_ - cluster: bcm - data_preparation: multimodal/download_multimodal - training: clip/vit_B_32 ...

  2. In the stages field of conf/config.yaml, make sure the training stage is included. For example,


    stages: - data_preparation - training ...

  3. Execute launcher pipeline: python3


  1. NeMo CLIP does not yet support gradient accumulation. Therefore, please ensure micro_batch_size * num_gpus = global_batch_size (i.e. gradient accumulation step is 1).

  2. For CLIP models, you can enable Exponential Moving Average (EMA) by setting training.exp_manager.ema.enable=True. However, EMA is currently not compatible with AMP O2. To use EMA, you must disable AMP O2 by setting training.model.megatron_amp_O2=False. Enabling EMA can help your model converge faster, but be aware that it may result in a slight performance penalty.

