Training with Predefined Configurations

NVIDIA provides 3 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/imagen directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.

The training process for Imagen typically involves multiple stages of models at different resolutions (64x64, 256x256, 1024x1024). Datasets are deliberately alternated to achieve superior image quality. We provide 5 training configurations here:

Base model:
  • base64-2b: Training 2B params 64x64 model as described in Imagen paper Appendix F.1

  • base64-500m: Training 500m params 64x64 model with reduced channel size

SR256 model:
  • sr256-600m: Training 600m params 256x256 EfficientUNet model as described in Imagen paper Appendix F.2

  • sr256-400m: Training 400m params 256x256 UNet model with similar configuration as DeepFloyd IF-II-M

SR1024 model:
  • sr1024-600m: Training 600m params 1024x1024 EfficientUNet model as described in Imagen paper Appendix F.3

Model

Resolution

Unet Model Size (M)

Text Conditioning Model

Batch s=Size per GPU

Accumulated Global Batch Size

Precision

AMP Level

Effective Dataset Size

Dataset Filtering

Total Training Samples Seen

500m_res_64

64

524

‘t5-11b’

128

4096

BF16

O1

676M

None

5.12B

2b_res_64

64

2100

‘t5-11b’

32

2048

BF16

O1

676M

None

5.12B

600m_res_256

256

646

‘t5-11b’

64

4096

BF16

O1

544M

resolution >= 256

1.23B

400m_res_256

256

429

‘t5-11b’

16

2048

BF16

O1

544M

resolution >= 256

1.23B

600m_res_1024

1024

427

‘t5-11b’

64

4096

BF16

O1

39.5M

resolution >= 1024

1.23B

To enable the training stage with Imagen, make sure:

  1. In the defaults section, update the training field to point to the desired Imagen configuration file. For example, if you want to start training the base64 500m model from scratch, change the training field to imagen/500m_res_64.yaml.

    defaults:
      - _self_
      - cluster: bcm
      - data_preparation: multimodal/download_multimodal
      - training: imagen/500m_res_64.yaml
      ...
    
  2. In the stages field, make sure the training stage is included. For example,

    stages:
      - data_preparation
      - training
      ...
    

Remark:

1.There is no training dependency between the base and super resolution models. That is, one can train all models (64x64, 256x256, 1024x1024) simultaneously and independently, given sufficient computing resources.

2.We recommend preprocessing the training dataset with pre-cached embeddings. Imagen typically uses T5 embeddings, and T5 encoders are large in size. Loading them during training can significantly reduce training throughput. We observed a significant drop in batch size and throughput when using the online-encoding option.

3.Despite the claim made in the Imagen paper that EfficientUNet has better throughput and does not compromise visual quality, we empirically found that training the regular UNet for the SR model still produces more visually appealing images.

4.We provide two scheduling/sampling options for Imagen Training: Continuous DDPM and EDM. Continuous DDPM is the default scheme used in the original paper. Empirically, we found that EDM yields a lower FID score. (Reference to the EDM paper is given in parentheses.)

5.While the paper uses the T5-xxl (4096 dimension) encoder, we use the T5-11b (1024 dimension) encoder during training due to space considerations.

6.There is no guarantee that training Imagen for an extended period will necessarily result in improved FID/CLIP scores. To achieve best results, we suggest evaluating various checkpoints during the late stages of convergence.

Please note that the scripts provided by NVIDIA are optional to use, and they may download models based on public data that could contain copyrighted material. It is advisable to consult your legal department before using these scripts.

model

link

download by scripts

T5-11b

T5-11b link

Yes

T5-xxl

T5-xxl link

Yes