Training with Predefined Configurations

We have curated configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/stable_diffusion directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.

The training process for Stable Diffusion typically involves multiple stages in which different resolutions and datasets are deliberately alternated to achieve superior image quality. We provide two training configurations here: one for pretraining at a resolution of 256x256 and another for resuming from the pretraining weights and continuing to improve the model’s performance. It is important to note that to maintain image quality improvement, each stage requires loading the unet weights from the previous stage and ideally switching to another dataset to improve diversity. We have verified convergence up to SD v1.5 by switching between multiple subsets of our multimodal blend*. Reproducing SD v1.5 using the datasets recommended in the Huggingface model cards is straightforward with our implementation.

Note

Our multimodal dataset is originated from Common Crawl with custom filtering and contains 670M image-caption pairs.

Stage

Resolution

Unet model size (M)

Text conditioning model

Batch Size per GPU

Accumulated Global Batch Size

Precision

AMP Level

Effective Dataset size

Dataset Filtering

Total Training Samples Seen

Pretraining 256 859 openai/clip-vit-large-patch14 128 8192 FP16 O1 676M None 680M
SD v1.1 512 859 openai/clip-vit-large-patch14 32 8192 FP16 O1 39.5M Resolution >= 1024x1024 409M
SD v1.2 512 859 openai/clip-vit-large-patch14 32 8192 FP16 O1 218M Resolution >= 512x512 1.23B
SD v1.5 512 859 openai/clip-vit-large-patch14 32 8192 FP16 O1 218M Resolution >= 512x512 1.32B

For SD v2.0 base, the text conditioning model is replaced with OpenCLIP-ViT/H. Training stages are similar to the original configuration, which contain pretraining with 256x256 resolution and followed by finetuing with 512x512 resolution. We can use the datasets recommended in the Huggingface model cards to reproduce the result of SD v2.0 base.

Stage

Resolution

Unet model size (M)

Text conditioning model

Batch Size per GPU

Accumulated Global Batch Size

Precision

AMP Level

Effective Dataset size

Dataset Filtering

Total Training Samples Seen

SD v2.0 Pretraining 256 865 OpenCLIP-ViT/H 128 8192 FP16 O1 676M None 680M
SD v2.0 Base 512 865 OpenCLIP-ViT/H 32 8192 FP16 O1 218M Resolution >= 512x512 1.32B

To enable the training stage with Stable Diffusion, make sure:

  1. In the defaults section, update the training field to point to the desired Stable Diffusion configuration file. For example, if you want to start the pretraining from scratch, change the training field to stable_diffusion/860m_res_256.yaml.

    Copy
    Copied!
                

    defaults: - _self_ - cluster: bcm - data_preparation: multimodal/download_multimodal - training: stable_diffusion/860m_res_256_pretrain.yaml ...

  2. In the stages field, make sure the training stage is included. For example,

    Copy
    Copied!
                

    stages: - data_preparation - training ...

Remark
  1. To continue training the Stable Diffusion model from the pretraining results, we reset the training process by only loading the UNet weights. You can do this by using the last checkpoint from the previous training and passing it to training.model.unet_config.from_pretrained. Due to different naming in model parameters, indicating you are loading from a checkpoint trained by NeMo, set training.model.unet_config.from_NeMo=True. If you are resuming training from a Huggingface checkpoint, you can also load the Unet weights from that source. In this case, you need to set training.model.unet_config.from_NeMo=False.

  2. To improve the quality of generated images, it is recommended to utilize pretrained checkpoints for AutoencoderKL and CLIP. We have compiled a list of recommended sources for these checkpoints, but please note that the AutoencoderKL checkpoint cannot be downloaded via the provided script. Instead, you must download it locally and ensure that the correct path is specified in the configuration file before proceeding.

Please be advised the scripts that NVIDIA provides are optional to use and will download models that are based on public data which may contain copyrighted material. Consult your legal department before using these scripts.

The following are the pretrained checkpoints for SD v1

model

link

download by script

AutoencoderKL link autoencoder No
CLIP link clip-L-14 Yes

The following are the pretrained checkpoints for SD v2.0

model

link

download by script

AutoencoderKL link autoencoderKL No
OpenCLIP link clip-H-14 Yes

In the latest update, we have introduced support for using the Clip encoder provided by NeMo. To learn how to convert weights to NeMo Clip checkpoints, please refer to Section 6.9 in the documentation. If you prefer to restore the previous behavior and use the HF Clip encoder, you can find instructions in the comments within the stable diffusion configuration files.

Note

If you use NeMo Clip checkpoints as the Clip encoder, the Clip checkpoint needs to be kept in the same path as specified in the configuration file when you load a stable diffusion checkpoint.

  1. There is no guarantee that training Stable Diffusion for an extended period will necessarily result in improved FID/CLIP scores. To achieve best results, we suggest evaluating various checkpoints during the late stages of convergence.

Previous Data Preparation
Next Framework Inference
© Copyright 2023-2024, NVIDIA. Last updated on May 17, 2024.