Training with Predefined Configurations

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide Training with Predefined Configurations

NVIDIA provides 3 configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/imagen directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.

The training process for Imagen typically involves multiple stages of models at different resolutions (64x64, 256x256, 1024x1024). Datasets are deliberately alternated to achieve superior image quality. We provide 5 training configurations here:

Base model:
SR256 model:
SR1024 model:

Model	Resolution	Unet Model Size (M)	Text Conditioning Model	Batch s=Size per GPU	Accumulated Global Batch Size	Precision	AMP Level	Effective Dataset Size	Dataset Filtering	Total Training Samples Seen
500m_res_64	64	524	‘t5-11b’	128	4096	BF16	O1	676M	None	5.12B
2b_res_64	64	2100	‘t5-11b’	32	2048	BF16	O1	676M	None	5.12B
600m_res_256	256	646	‘t5-11b’	64	4096	BF16	O1	544M	resolution >= 256	1.23B
400m_res_256	256	429	‘t5-11b’	16	2048	BF16	O1	544M	resolution >= 256	1.23B
600m_res_1024	1024	427	‘t5-11b’	64	4096	BF16	O1	39.5M	resolution >= 1024	1.23B

To enable the training stage with Imagen, make sure:

In the defaults section, update the training field to point to the desired Imagen configuration file. For example, if you want to start training the base64 500m model from scratch, change the training field to imagen/500m_res_64.yaml.
Copy

Copied!
```
            
            defaults:
  - _self_
  - cluster: bcm
  - data_preparation: multimodal/download_multimodal
  - training: imagen/500m_res_64.yaml
  ...
        
```

In the stages field, make sure the training stage is included. For example,

Copy
Copied!

            
            stages:
  - data_preparation
  - training
  ...

Remark:

1.There is no training dependency between the base and super resolution models. That is, one can train all models (64x64, 256x256, 1024x1024) simultaneously and independently, given sufficient computing resources.

2.We recommend preprocessing the training dataset with pre-cached embeddings. Imagen typically uses T5 embeddings, and T5 encoders are large in size. Loading them during training can significantly reduce training throughput. We observed a significant drop in batch size and throughput when using the online-encoding option.

3.Despite the claim made in the Imagen paper that EfficientUNet has better throughput and does not compromise visual quality, we empirically found that training the regular UNet for the SR model still produces more visually appealing images.

4.We provide two scheduling/sampling options for Imagen Training: Continuous DDPM and EDM. Continuous DDPM is the default scheme used in the original paper. Empirically, we found that EDM yields a lower FID score. (Reference to the EDM paper is given in parentheses.)

5.While the paper uses the T5-xxl (4096 dimension) encoder, we use the T5-11b (1024 dimension) encoder during training due to space considerations.

6.There is no guarantee that training Imagen for an extended period will necessarily result in improved FID/CLIP scores. To achieve best results, we suggest evaluating various checkpoints during the late stages of convergence.

Please note that the scripts provided by NVIDIA are optional to use, and they may download models based on public data that could contain copyrighted material. It is advisable to consult your legal department before using these scripts.

model	link	download by scripts
T5-11b	T5-11b link	Yes
T5-xxl	T5-xxl link	Yes

Previous Data Preparation

Next Framework Inference