Training with Predefined Configurations

We have curated configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/stable_diffusion directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.

The training process for Stable Diffusion typically involves multiple stages in which different resolutions and datasets are deliberately alternated to achieve superior image quality. We provide two training configurations here: one for pretraining at a resolution of 256x256 and another for resuming from the pretraining weights and continuing to improve the model’s performance. It is important to note that to maintain image quality improvement, each stage requires loading the unet weights from the previous stage and ideally switching to another dataset to improve diversity. We have verified convergence up to SD v1.5 by switching between multiple subsets of our multimodal blend*. Reproducing SD v1.5 using the datasets recommended in the Huggingface model cards is straightforward with our implementation.

Note

Our multimodal dataset is originated from Common Crawl with custom filtering and contains 670M image-caption pairs.

Stage	Resolution	Unet model size (M)	Text conditioning model	Batch Size per GPU	Accumulated Global Batch Size	Precision	AMP Level	Effective Dataset size	Dataset Filtering	Total Training Samples Seen
Pretraining	256	859	openai/clip-vit-large-patch14	128	8192	FP16	O1	676M	None	680M
SD v1.1	512	859	openai/clip-vit-large-patch14	32	8192	FP16	O1	39.5M	Resolution >= 1024x1024	409M
SD v1.2	512	859	openai/clip-vit-large-patch14	32	8192	FP16	O1	218M	Resolution >= 512x512	1.23B
SD v1.5	512	859	openai/clip-vit-large-patch14	32	8192	FP16	O1	218M	Resolution >= 512x512	1.32B

For SD v2.0 base, the text conditioning model is replaced with OpenCLIP-ViT/H. Training stages are similar to the original configuration, which contain pretraining with 256x256 resolution and followed by finetuing with 512x512 resolution. We can use the datasets recommended in the Huggingface model cards to reproduce the result of SD v2.0 base.

Stage	Resolution	Unet model size (M)	Text conditioning model	Batch Size per GPU	Accumulated Global Batch Size	Precision	AMP Level	Effective Dataset size	Dataset Filtering	Total Training Samples Seen
SD v2.0 Pretraining	256	865	OpenCLIP-ViT/H	128	8192	FP16	O1	676M	None	680M
SD v2.0 Base	512	865	OpenCLIP-ViT/H	32	8192	FP16	O1	218M	Resolution >= 512x512	1.32B

To enable the training stage with Stable Diffusion, make sure:

In the defaults section, update the training field to point to the desired Stable Diffusion configuration file. For example, if you want to start the pretraining from scratch, change the training field to stable_diffusion/860m_res_256.yaml.
```
defaults:
  - _self_
  - cluster: bcm
  - data_preparation: multimodal/download_multimodal
  - training: stable_diffusion/860m_res_256_pretrain.yaml
  ...
```
In the stages field, make sure the training stage is included. For example,
```
stages:
  - data_preparation
  - training
  ...
```

Remark

To continue training the Stable Diffusion model from the pretraining results, we reset the training process by only loading the UNet weights. You can do this by using the last checkpoint from the previous training and passing it to training.model.unet_config.from_pretrained. Due to different naming in model parameters, indicating you are loading from a checkpoint trained by NeMo, set training.model.unet_config.from_NeMo=True. If you are resuming training from a Huggingface checkpoint, you can also load the Unet weights from that source. In this case, you need to set training.model.unet_config.from_NeMo=False.
To improve the quality of generated images, it is recommended to utilize pretrained checkpoints for AutoencoderKL and CLIP. We have compiled a list of recommended sources for these checkpoints, but please note that the AutoencoderKL checkpoint cannot be downloaded via the provided script. Instead, you must download it locally and ensure that the correct path is specified in the configuration file before proceeding.

Please be advised the scripts that NVIDIA provides are optional to use and will download models that are based on public data which may contain copyrighted material. Consult your legal department before using these scripts.

The following are the pretrained checkpoints for SD v1

model	link	download by script
AutoencoderKL	link autoencoder	No
CLIP	link clip-L-14	Yes

The following are the pretrained checkpoints for SD v2.0

model	link	download by script
AutoencoderKL	link autoencoderKL	No
OpenCLIP	link clip-H-14	Yes

In the latest update, we have introduced support for using the Clip encoder provided by NeMo. To learn how to convert weights to NeMo Clip checkpoints, please refer to Section 6.9 in the documentation. If you prefer to restore the previous behavior and use the HF Clip encoder, you can find instructions in the comments within the stable diffusion configuration files.

Note

If you use NeMo Clip checkpoints as the Clip encoder, the Clip checkpoint needs to be kept in the same path as specified in the configuration file when you load a stable diffusion checkpoint.

There is no guarantee that training Stable Diffusion for an extended period will necessarily result in improved FID/CLIP scores. To achieve best results, we suggest evaluating various checkpoints during the late stages of convergence.