Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Training with Predefined Configurations
We have curated configurations with suggested hyperparameters specifically for the NVIDIA DGX SuperPOD, which is equipped with 8 NVIDIA A100 80GB GPUs. The configurations for the curated models can be found in the conf/training/stable_diffusion directory. You can access and modify the parameters to adjust the hyperparameters for your specific training runs. By customizing these settings, you can tailor the model’s performance and training efficiency to better suit your needs and requirements.
The training process for Stable Diffusion typically involves multiple stages in which different resolutions and datasets are deliberately alternated to achieve superior image quality. We provide two training configurations here: one for pretraining at a resolution of 256x256 and another for resuming from the pretraining weights and continuing to improve the model’s performance. ITo maintain image quality improvement, it is important to load the UNet weights from the previous stage at each step. Ideally, switching to another dataset will also help improve diversity. We have verified convergence up to SD v1.5 by switching between multiple subsets of our multimodal blend. Reproducing SD v1.5 using the datasets recommended in the Hugging Face model cards is straightforward with our implementation.
Note
Our multimodal dataset, derived from Common Crawl with custom filtering, contains 670 million image-caption pairs.
Stage |
Resolution |
Unet model size (M) |
Text conditioning model |
Batch Size per GPU |
Accumulated Global Batch Size |
Precision |
AMP Level |
Effective Dataset size |
Dataset Filtering |
Total Training Samples Seen |
---|---|---|---|---|---|---|---|---|---|---|
Pretraining |
256 |
859 |
openai/clip-vit-large-patch14 |
128 |
8192 |
FP16 |
O1 |
676M |
None |
680M |
SD v1.1 |
512 |
859 |
openai/clip-vit-large-patch14 |
32 |
8192 |
FP16 |
O1 |
39.5M |
Resolution >= 1024x1024 |
409M |
SD v1.2 |
512 |
859 |
openai/clip-vit-large-patch14 |
32 |
8192 |
FP16 |
O1 |
218M |
Resolution >= 512x512 |
1.23B |
SD v1.5 |
512 |
859 |
openai/clip-vit-large-patch14 |
32 |
8192 |
FP16 |
O1 |
218M |
Resolution >= 512x512 |
1.32B |
For SD v2.0 base, the text conditioning model is replaced with OpenCLIP-ViT/H. The training stages are similar to the original configuration, starting with pretraining at 256x256 resolution, followed by fine-tuning at 512x512 resolution. We can use the datasets recommended in the Hugging Face model cards to reproduce the result of SD v2.0 base.
Stage |
Resolution |
Unet model size (M) |
Text conditioning model |
Batch Size per GPU |
Accumulated Global Batch Size |
Precision |
AMP Level |
Effective Dataset size |
Dataset Filtering |
Total Training Samples Seen |
---|---|---|---|---|---|---|---|---|---|---|
SD v2.0 Pretraining |
256 |
865 |
OpenCLIP-ViT/H |
128 |
8192 |
FP16 |
O1 |
676M |
None |
680M |
SD v2.0 Base |
512 |
865 |
OpenCLIP-ViT/H |
32 |
8192 |
FP16 |
O1 |
218M |
Resolution >= 512x512 |
1.32B |
The SDXL base model now generates images at a resolution of 1024x1024. It features a UNet architecture 3x larger than its predecessor and incorporates an additional text encoder, OpenCLIP ViT-bigG/14, which significantly increases the model’s size.
Stage |
Resolution |
Unet model size (B) |
Text conditioning model |
Batch Size per GPU |
Accumulated Global Batch Size |
Precision |
AMP Level |
Effective Dataset size |
Dataset Filtering |
Total Training Samples Seen |
---|---|---|---|---|---|---|---|---|---|---|
Stage_1 |
256 |
3.5 |
openai/clip-vit-large-patch14 and laion/CLIP-ViT-bigG-14-laion2B-39B-b160k |
64 |
2048 |
BF16 |
O1 |
676M |
None |
1.23B |
Stage_2 |
512 |
3.5 |
openai/clip-vit-large-patch14 and laion/CLIP-ViT-bigG-14-laion2B-39B-b160k |
16 |
2048 |
BF16 |
O1 |
218M |
Resolution >= 512x512 |
409M |
Stage_3 |
1024 |
3.5 |
openai/clip-vit-large-patch14 and laion/CLIP-ViT-bigG-14-laion2B-39B-b160k |
4 |
256 |
BF16 |
O1 |
39.5M |
Resolution >= 1024x1024 |
/ |
To enable the training stage with Stable Diffusion, complete the steps in this section. Training proceeds iteratively at this stage until a satisfactory image quality is achieved.
Enable the Stable Diffusion Training Stage
In the
defaults
section, update thetraining
field to point to the desired Stable Diffusion configuration file. For example, if you want to start the pretraining from scratch, change the training field tostable_diffusion/860m_res_256.yaml
.defaults: - _self_ - cluster: bcm - data_preparation: multimodal/download_multimodal - training: stable_diffusion/860m_res_256_pretrain.yaml ...
In the
stages
field, make sure the training stage is included. For example:stages: - data_preparation - training ...
Please be advised the scripts that NVIDIA provides are optional to use and will download models that are based on public data which may contain copyrighted material. Consult your legal department before using these scripts.
The following are the pretrained checkpoints for SD v1:
model |
link |
download by script |
---|---|---|
AutoencoderKL |
No |
|
CLIP |
Yes |
The following are the pretrained checkpoints for SD v2.0:
model |
link |
download by script |
---|---|---|
AutoencoderKL |
No |
|
OpenCLIP |
Yes |
The following are the pretrained checkpoints for SDXL:
model |
link |
download by script |
---|---|---|
AutoencoderKL |
No |
|
CLIP |
Yes |
|
OpenCLIP |
Yes |
In the latest update, we have introduced support for using the Clip encoder provided by NeMo. To learn how to convert weights to NeMo Clip checkpoints, please refer to Section 6.9 in the documentation. If you prefer to restore the previous behavior and use the Hugging Face Clip encoder, you can find instructions in the comments within the stable diffusion configuration files.
Note
If you use NeMo Clip checkpoints as the Clip encoder, the Clip checkpoint needs to be kept in the same path as specified in the configuration file when you load a stable diffusion checkpoint.
There is no guarantee that training Stable Diffusion for an extended period will necessarily result in improved FID/CLIP scores. To achieve best results, we suggest evaluating various checkpoints during the late stages of convergence.