Training with Predefined Configurations
We have curated configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/stable_diffusion directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.
The training process for Stable Diffusion typically involves multiple stages in which different resolutions and datasets are deliberately alternated to achieve superior image quality. We provide two training configurations here: one for pretraining at a resolution of 256x256 and another for resuming from the pretraining weights and continuing to improve the model’s performance. It is important to note that to maintain image quality improvement, each stage requires loading the unet weights from the previous stage and ideally switching to another dataset to improve diversity. We have verified convergence up to SD v1.5 by switching between multiple subsets of our multimodal blend*. Reproducing SD v1.5 using the datasets recommended in the Huggingface model cards is straightforward with our implementation.
Note
Our multimodal dataset is originated from Common Crawl with custom filtering and contains 670M image-caption pairs.
Stage |
Resolution |
Unet model size (M) |
Text conditioning model |
Batch Size per GPU |
Accumulated Global Batch Size |
Precision |
AMP Level |
Effective Dataset size |
Dataset Filtering |
Total Training Samples Seen |
---|---|---|---|---|---|---|---|---|---|---|
Pretraining |
256 |
859 |
openai/clip-vit-large-patch14 |
128 |
8192 |
FP16 |
O1 |
676M |
None |
680M |
SD v1.1 |
512 |
859 |
openai/clip-vit-large-patch14 |
32 |
8192 |
FP16 |
O1 |
39.5M |
Resolution >= 1024x1024 |
409M |
SD v1.2 |
512 |
859 |
openai/clip-vit-large-patch14 |
32 |
8192 |
FP16 |
O1 |
218M |
Resolution >= 512x512 |
1.23B |
SD v1.5 |
512 |
859 |
openai/clip-vit-large-patch14 |
32 |
8192 |
FP16 |
O1 |
218M |
Resolution >= 512x512 |
1.32B |
For SD v2.0 base, the text conditioning model is replaced with OpenCLIP-ViT/H. Training stages are similar to the original configuration, which contain pretraining with 256x256 resolution and followed by finetuing with 512x512 resolution. We can use the datasets recommended in the Huggingface model cards to reproduce the result of SD v2.0 base.
Stage |
Resolution |
Unet model size (M) |
Text conditioning model |
Batch Size per GPU |
Accumulated Global Batch Size |
Precision |
AMP Level |
Effective Dataset size |
Dataset Filtering |
Total Training Samples Seen |
---|---|---|---|---|---|---|---|---|---|---|
SD v2.0 Pretraining |
256 |
865 |
OpenCLIP-ViT/H |
128 |
8192 |
FP16 |
O1 |
676M |
None |
680M |
SD v2.0 Base |
512 |
865 |
OpenCLIP-ViT/H |
32 |
8192 |
FP16 |
O1 |
218M |
Resolution >= 512x512 |
1.32B |
To enable the training stage with Stable Diffusion, make sure:
In the
defaults
section, update thetraining
field to point to the desired Stable Diffusion configuration file. For example, if you want to start the pretraining from scratch, change the training field tostable_diffusion/860m_res_256.yaml
.defaults: - _self_ - cluster: bcm - data_preparation: multimodal/download_multimodal - training: stable_diffusion/860m_res_256_pretrain.yaml ...
In the
stages
field, make sure the training stage is included. For example,stages: - data_preparation - training ...
Remark
To continue training the Stable Diffusion model from the pretraining results, we reset the training process by only loading the UNet weights. You can do this by using the last checkpoint from the previous training and passing it to
training.model.unet_config.from_pretrained
. Due to different naming in model parameters, indicating you are loading from a checkpoint trained by NeMo, settraining.model.unet_config.from_NeMo=True
. If you are resuming training from a Huggingface checkpoint, you can also load the Unet weights from that source. In this case, you need to settraining.model.unet_config.from_NeMo=False
.To improve the quality of generated images, it is recommended to utilize pretrained checkpoints for AutoencoderKL and CLIP. We have compiled a list of recommended sources for these checkpoints, but please note that the AutoencoderKL checkpoint cannot be downloaded via the provided script. Instead, you must download it locally and ensure that the correct path is specified in the configuration file before proceeding.
Please be advised the scripts that NVIDIA provides are optional to use and will download models that are based on public data which may contain copyrighted material. Consult your legal department before using these scripts.
The following are the pretrained checkpoints for SD v1
model |
link |
download by script |
---|---|---|
AutoencoderKL |
No |
|
CLIP |
Yes |
The following are the pretrained checkpoints for SD v2.0
model |
link |
download by script |
---|---|---|
AutoencoderKL |
No |
|
OpenCLIP |
Yes |
In the latest update, we have introduced support for using the Clip encoder provided by NeMo. To learn how to convert weights to NeMo Clip checkpoints, please refer to Section 6.9 in the documentation. If you prefer to restore the previous behavior and use the HF Clip encoder, you can find instructions in the comments within the stable diffusion configuration files.
Note
If you use NeMo Clip checkpoints as the Clip encoder, the Clip checkpoint needs to be kept in the same path as specified in the configuration file when you load a stable diffusion checkpoint.
There is no guarantee that training Stable Diffusion for an extended period will necessarily result in improved FID/CLIP scores. To achieve best results, we suggest evaluating various checkpoints during the late stages of convergence.