Performance

We evaluate Stable Diffusion model with FID-CLIP curve, and comparing it to other open-source ckpt at same scale of consumed sample.

FID (Fréchet Inception Distance) is a metric used to evaluate the quality of generated images in machine learning. It measures the distance between the real image distribution and the distribution of generated images using the features extracted by a pre-trained Inception model.

The VIT-L/14 version of the CLIP model was utilized to assess the relevance between image prompts and generated images.

The evaluation was conducted using different classifier-free guidance scales, specifically 1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, and 8.0. The evaluation process involved generating 30,000 images from randomly selected prompts from the COCO2014 validation dataset, with 50 PLMS steps, and evaluating the results at a resolution of 256x256.

We have referred to but made certain modifications to the training recipe outlined in Stable Diffusion Model Cards posted on Huggingface.

Our multimodal dataset is originated from Common Crawl with custom filtering.

Below, we present the outcomes obtained from our own checkpoint following Stable Diffusion Training, which can be compared to those of the open-source Stable Diffusion 1.5.

Stable_Diffusion_FID-CLIP.png

For Stable Diffusion 2.0 base, we followed the same configuration to evaluate. The results are presented below, our own checkpoint can be compared to open-source Stable Diffusion 2.0 base.

Stable_Diffusion_2.0_FID-CLIP.png

We measured the throughput of training Stable Diffusion models on different numbers of DGX A100 nodes and DGX H100 nodes, and we achieved near-linear scaling on both platforms.

We are comparing the out-of-box performance on DGX H100 machines with the same configuration from DGX A100 machines. This comparison is an apple-to-apple assessment, ensuring that we evaluate the relative performance of the two machine types under equivalent conditions and configurations.

The tables and charts below show the performance results for SD v2.

  • NVIDIA DGX SuperPODs (64 x 8 x A100 80GB for Stable Diffusion Res=512 model)

1

2

4

8

16

32

64

Samples per Second 268.31 540.14 1081.31 2138.23 4208.80 8144.76 15917.61
Stable Diffusion Res=512 Perfect Linear Scaling (Samples) 268.31 536.63 1073.26 2146.53 4293.05 8586.10 17172.20
Speedup 1x 2.01x 4.03x 7.97x 15.69x 30.36x 59.32x
Stable_Diffusion_(Res=512)_NeMo_Megatron_Throughput_(A100).svg

  • NVIDIA DGX SuperPODs (64 x 8 x H100 80GB for Stable Diffusion Res=512 model)

1

2

4

8

16

32

64

Samples per Second 511.90 997.86 2019.72 3856.87 7177.15 13326.83 25952.80
Stable Diffusion Res=512 Perfect Linear Scaling (Samples) 511.90 1023.80 2047.59 4095.18 8190.36 16380.72 32761.45
Speedup 1x 1.95x 3.95x 7.53x 14.02x 26.03x 50.7x
Stable_Diffusion_(Res=512)_NeMo_Megatron_Throughput_(H100).svg

  • DGX A100 vs. DGX H100: A Comparative Analysis of Stable Diffusion Training

Model

Nodes

Global Batch

Micro Batch

Precision

Global Batch/Sec (A100)

Global Batch/Sec (H100)

Speedup (x)

Stable Diffusion (Res=512) 4 1024 32 amp fp16 1.056 1.972 1.9
Stable_Diffusion_Training_Throughput_Comparison.svg

Latency times are started directly before the text encoding (CLIP) and stopped directly after the output image decoding (VAE). For framework we use the Torch Automated Mixed Precision (AMP) for FP16 computation. For TRT, we export the various models with the FP16 acceleration. We use the optimized TRT engine setup present in the deployment directory to get the numbers in the same environment as the framework.

GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Synonymous with num_images_per_prompt

Model Batch Size Sampler Inference Steps TRT FP 16 Latency (s) FW FP 16 (AMP) Latency (s) TRT vs FW Speedup (x)
Stable Diffusion (Res=512) 1 PLMS 50 0.9 3.3 3.7
Stable Diffusion (Res=512) 2 PLMS 50 1.7 5.2 3.1
Stable Diffusion (Res=512) 4 PLMS 50 2.9 9.2 3.2

The following is SD v2.0 performance GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Synonymous with num_images_per_prompt

Model Batch Size Sampler Inference Steps TRT FP 16 Latency (s) FW FP 16 (AMP) Latency (s) TRT vs FW Speedup (x)
Stable Diffusion (Res=512) 1 PLMS 50 0.9 3.2 3.5
Stable Diffusion (Res=512) 2 PLMS 50 1.6 5.0 3.2
Stable Diffusion (Res=512) 4 PLMS 50 2.7 8.5 3.1
Previous Model Deployment
Next Dreambooth
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.