Training with Predefined Configurations

NVIDIA provides 3 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/imagen directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.

The training process for Imagen typically involves multiple stages of models at different resolutions (64x64, 256x256, 1024x1024). Datasets are deliberately alternated to achieve superior image quality. We provide 5 training configurations here:

Base model:
  • base64-2b: Training 2B params 64x64 model as described in Imagen paper Appendix F.1

  • base64-500m: Training 500m params 64x64 model with reduced channel size

SR256 model:
  • sr256-600m: Training 600m params 256x256 EfficientUNet model as described in Imagen paper Appendix F.2

  • sr256-400m: Training 400m params 256x256 UNet model with similar configuration as DeepFloyd IF-II-M

SR1024 model:
  • sr1024-600m: Training 600m params 1024x1024 EfficientUNet model as described in Imagen paper Appendix F.3



Unet Model Size (M)

Text Conditioning Model

Batch s=Size per GPU

Accumulated Global Batch Size


AMP Level

Effective Dataset Size

Dataset Filtering

Total Training Samples Seen

500m_res_64 64 524 ‘t5-11b’ 128 4096 BF16 O1 676M None 5.12B
2b_res_64 64 2100 ‘t5-11b’ 32 2048 BF16 O1 676M None 5.12B
600m_res_256 256 646 ‘t5-11b’ 64 4096 BF16 O1 544M resolution >= 256 1.23B
400m_res_256 256 429 ‘t5-11b’ 16 2048 BF16 O1 544M resolution >= 256 1.23B
600m_res_1024 1024 427 ‘t5-11b’ 64 4096 BF16 O1 39.5M resolution >= 1024 1.23B

To enable the training stage with Imagen, make sure:

  1. In the defaults section, update the training field to point to the desired Imagen configuration file. For example, if you want to start training the base64 500m model from scratch, change the training field to imagen/500m_res_64.yaml.


    defaults: - _self_ - cluster: bcm - data_preparation: multimodal/download_multimodal - training: imagen/500m_res_64.yaml ...

  2. In the stages field, make sure the training stage is included. For example,


    stages: - data_preparation - training ...


1.There is no training dependency between the base and super resolution models. That is, one can train all models (64x64, 256x256, 1024x1024) simultaneously and independently, given sufficient computing resources.

2.We recommend preprocessing the training dataset with pre-cached embeddings. Imagen typically uses T5 embeddings, and T5 encoders are large in size. Loading them during training can significantly reduce training throughput. We observed a significant drop in batch size and throughput when using the online-encoding option.

3.Despite the claim made in the Imagen paper that EfficientUNet has better throughput and does not compromise visual quality, we empirically found that training the regular UNet for the SR model still produces more visually appealing images.

4.We provide two scheduling/sampling options for Imagen Training: Continuous DDPM and EDM. Continuous DDPM is the default scheme used in the original paper. Empirically, we found that EDM yields a lower FID score. (Reference to the EDM paper is given in parentheses.)

5.While the paper uses the T5-xxl (4096 dimension) encoder, we use the T5-11b (1024 dimension) encoder during training due to space considerations.

6.There is no guarantee that training Imagen for an extended period will necessarily result in improved FID/CLIP scores. To achieve best results, we suggest evaluating various checkpoints during the late stages of convergence.

Please note that the scripts provided by NVIDIA are optional to use, and they may download models based on public data that could contain copyrighted material. It is advisable to consult your legal department before using these scripts.



download by scripts

T5-11b T5-11b link Yes
T5-xxl T5-xxl link Yes
Previous Data Preparation
Next Framework Inference
© Copyright 2023-2024, NVIDIA. Last updated on Feb 22, 2024.