Training with Predefined Configurations

We have curated 3 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/clip directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.

Model

Image size

Text Model size (M)

Image Model size (M)

Output dim

Batch Size per GPU

Accumulated Global Batch Size

Precision

AMP Level

Total Training Samples Seen

ViT B/32

224

63

87

512

500

32000

BF16

O2

12B

ViT L/14

224

123

303

768

112

32256

BF16

O2

12B

ViT H/14

224

354

638

1024

80

32000

BF16

O2

12B

To enable the training stage with a CLIP model, configure the configuration files:

  1. In the defaults section of conf/config.yaml, update the training field to point to the desired CLIP configuration file. For example, if you want to use the ViT B/32 (i.e. vit_B_32), change the training field to clip/vit_B_32.

    defaults:
      - _self_
      - cluster: bcm
      - data_preparation: multimodal/download_multimodal
      - training: clip/vit_B_32
      ...
    
  2. In the stages field of conf/config.yaml, make sure the training stage is included. For example,

    stages:
      - data_preparation
      - training
      ...
    
  3. Execute launcher pipeline: python3 main.py

Remarks:

  1. NeMo CLIP does not yet support gradient accumulation. Therefore, please ensure micro_batch_size * num_gpus = global_batch_size (i.e. gradient accumulation step is 1).

  2. For CLIP models, you can enable Exponential Moving Average (EMA) by setting training.exp_manager.ema.enable=True. However, EMA is currently not compatible with AMP O2. To use EMA, you must disable AMP O2 by setting training.model.megatron_amp_O2=False. Enabling EMA can help your model converge faster, but be aware that it may result in a slight performance penalty.