Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Training with Predefined Configurations

We have curated configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/stable_diffusion directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.

The training process for Stable Diffusion typically involves multiple stages in which different resolutions and datasets are deliberately alternated to achieve superior image quality. We provide two training configurations here: one for pretraining at a resolution of 256x256 and another for resuming from the pretraining weights and continuing to improve the model’s performance. ITo maintain image quality improvement, it is important to load the UNet weights from the previous stage at each step. Ideally, switching to another dataset will also help improve diversity. We have verified convergence up to SD v1.5 by switching between multiple subsets of our multimodal blend. Reproducing SD v1.5 using the datasets recommended in the Hugging Face model cards is straightforward with our implementation.

Note

Our multimodal dataset, derived from Common Crawl with custom filtering, contains 670 million image-caption pairs.

Stage

Resolution

Unet model size (M)

Text conditioning model

Batch Size per GPU

Accumulated Global Batch Size

Precision

AMP Level

Effective Dataset size

Dataset Filtering

Total Training Samples Seen

Pretraining

256

859

openai/clip-vit-large-patch14

128

8192

FP16

O1

676M

None

680M

SD v1.1

512

859

openai/clip-vit-large-patch14

32

8192

FP16

O1

39.5M

Resolution >= 1024x1024

409M

SD v1.2

512

859

openai/clip-vit-large-patch14

32

8192

FP16

O1

218M

Resolution >= 512x512

1.23B

SD v1.5

512

859

openai/clip-vit-large-patch14

32

8192

FP16

O1

218M

Resolution >= 512x512

1.32B

For SD v2.0 base, the text conditioning model is replaced with OpenCLIP-ViT/H. The training stages are similar to the original configuration, starting with pretraining at 256x256 resolution, followed by fine-tuning at 512x512 resolution. We can use the datasets recommended in the Hugging Face model cards to reproduce the result of SD v2.0 base.

Stage

Resolution

Unet model size (M)

Text conditioning model

Batch Size per GPU

Accumulated Global Batch Size

Precision

AMP Level

Effective Dataset size

Dataset Filtering

Total Training Samples Seen

SD v2.0 Pretraining

256

865

OpenCLIP-ViT/H

128

8192

FP16

O1

676M

None

680M

SD v2.0 Base

512

865

OpenCLIP-ViT/H

32

8192

FP16

O1

218M

Resolution >= 512x512

1.32B

The SDXL base model now generates images at a resolution of 1024x1024. It features a UNet architecture 3x larger than its predecessor and incorporates an additional text encoder, OpenCLIP ViT-bigG/14, which significantly increases the model’s size.

Stage

Resolution

Unet model size (B)

Text conditioning model

Batch Size per GPU

Accumulated Global Batch Size

Precision

AMP Level

Effective Dataset size

Dataset Filtering

Total Training Samples Seen

Stage_1

256

3.5

openai/clip-vit-large-patch14 and laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

64

2048

BF16

O1

676M

None

1.23B

Stage_2

512

3.5

openai/clip-vit-large-patch14 and laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

16

2048

BF16

O1

218M

Resolution >= 512x512

409M

Stage_3

1024

3.5

openai/clip-vit-large-patch14 and laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

4

256

BF16

O1

39.5M

Resolution >= 1024x1024

/

To enable the training stage with Stable Diffusion, complete the steps in this section. Training proceeds iteratively at this stage until a satisfactory image quality is achieved.

Enable the Stable Diffusion Training Stage

  1. In the defaults section, update the training field to point to the desired Stable Diffusion configuration file. For example, if you want to start the pretraining from scratch, change the training field to stable_diffusion/860m_res_256.yaml.

    defaults:
      - _self_
      - cluster: bcm
      - data_preparation: multimodal/download_multimodal
      - training: stable_diffusion/860m_res_256_pretrain.yaml
      ...
    
  2. In the stages field, make sure the training stage is included. For example:

    stages:
      - data_preparation
      - training
      ...
    

Please be advised the scripts that NVIDIA provides are optional to use and will download models that are based on public data which may contain copyrighted material. Consult your legal department before using these scripts.

The following are the pretrained checkpoints for SD v1:

model

link

download by script

AutoencoderKL

link autoencoder

No

CLIP

link clip-L-14

Yes

The following are the pretrained checkpoints for SD v2.0:

model

link

download by script

AutoencoderKL

link autoencoderKL

No

OpenCLIP

link clip-H-14

Yes

The following are the pretrained checkpoints for SDXL:

model

link

download by script

AutoencoderKL

link autoencoderKL

No

CLIP

link clip-L-14

Yes

OpenCLIP

link clip-G-14

Yes

In the latest update, we have introduced support for using the Clip encoder provided by NeMo. To learn how to convert weights to NeMo Clip checkpoints, please refer to Section 6.9 in the documentation. If you prefer to restore the previous behavior and use the Hugging Face Clip encoder, you can find instructions in the comments within the stable diffusion configuration files.

Note

If you use NeMo Clip checkpoints as the Clip encoder, the Clip checkpoint needs to be kept in the same path as specified in the configuration file when you load a stable diffusion checkpoint.

There is no guarantee that training Stable Diffusion for an extended period will necessarily result in improved FID/CLIP scores. To achieve best results, we suggest evaluating various checkpoints during the late stages of convergence.