Performance

Training Quality Results

We evaluate Stable Diffusion model with FID-CLIP curve, and comparing it to other open-source ckpt at same scale of consumed sample.

FID (Fréchet Inception Distance) is a metric used to evaluate the quality of generated images in machine learning. It measures the distance between the real image distribution and the distribution of generated images using the features extracted by a pre-trained Inception model.

The VIT-L/14 version of the CLIP model was utilized to assess the relevance between image prompts and generated images.

The evaluation was conducted using different classifier-free guidance scales, specifically 1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, and 8.0. The evaluation process involved generating 30,000 images from randomly selected prompts from the COCO2014 validation dataset, with 50 PLMS steps, and evaluating the results at a resolution of 256x256.

We have referred to but made certain modifications to the training recipe outlined in Stable Diffusion Model Cards posted on Huggingface.

Our multimodal dataset is originated from Common Crawl with custom filtering.

Below, we present the outcomes obtained from our own checkpoint following Stable Diffusion Training, which can be compared to those of the open-source Stable Diffusion 1.5.

Stable Diffusion FID-CLIP Results

For Stable Diffusion 2.0 base, we followed the same configuration to evaluate. The results are presented below, our own checkpoint can be compared to open-source Stable Diffusion 2.0 base.

Stable Diffusion 2.0 FID-CLIP Results

Training Performance Results

We measured the throughput of training Stable Diffusion models on different numbers of DGX A100 nodes and DGX H100 nodes, and we achieved near-linear scaling on both platforms.

We are comparing the out-of-box performance on DGX H100 machines with the same configuration from DGX A100 machines. This comparison is an apple-to-apple assessment, ensuring that we evaluate the relative performance of the two machine types under equivalent conditions and configurations.

The tables and charts below show the performance results for SD v2.

  • NVIDIA DGX SuperPODs (64 x 8 x A100 80GB for Stable Diffusion Res=512 model)

1

2

4

8

16

32

64

Samples per Second

268.31

540.14

1081.31

2138.23

4208.80

8144.76

15917.61

Stable Diffusion Res=512

Perfect Linear Scaling (Samples)

268.31

536.63

1073.26

2146.53

4293.05

8586.10

17172.20

Speedup

1x

2.01x

4.03x

7.97x

15.69x

30.36x

59.32x

../../../_images/Stable_Diffusion_(Res=512)_NeMo_Megatron_Throughput_(A100).svg
  • NVIDIA DGX SuperPODs (64 x 8 x H100 80GB for Stable Diffusion Res=512 model)

1

2

4

8

16

32

64

Samples per Second

511.90

997.86

2019.72

3856.87

7177.15

13326.83

25952.80

Stable Diffusion Res=512

Perfect Linear Scaling (Samples)

511.90

1023.80

2047.59

4095.18

8190.36

16380.72

32761.45

Speedup

1x

1.95x

3.95x

7.53x

14.02x

26.03x

50.7x

../../../_images/Stable_Diffusion_(Res=512)_NeMo_Megatron_Throughput_(H100).svg
  • DGX A100 vs. DGX H100: A Comparative Analysis of Stable Diffusion Training

Model

Nodes

Global Batch

Micro Batch

Precision

Global Batch/Sec (A100)

Global Batch/Sec (H100)

Speedup (x)

Stable Diffusion (Res=512)

4

1024

32

amp fp16

1.056

1.972

1.9

../../../_images/Stable_Diffusion_Training_Throughput_Comparison.svg

Inference Performance Results

Latency times are started directly before the text encoding (CLIP) and stopped directly after the output image decoding (VAE). For framework we use the Torch Automated Mixed Precision (AMP) for FP16 computation. For TRT, we export the various models with the FP16 acceleration. We use the optimized TRT engine setup present in the deployment directory to get the numbers in the same environment as the framework.

GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Synonymous with num_images_per_prompt

Model

Batch Size

Sampler

Inference Steps

TRT FP 16 Latency (s)

FW FP 16 (AMP) Latency (s)

TRT vs FW Speedup (x)

Stable Diffusion (Res=512)

1

PLMS

50

0.9

3.3

3.7

Stable Diffusion (Res=512)

2

PLMS

50

1.7

5.2

3.1

Stable Diffusion (Res=512)

4

PLMS

50

2.9

9.2

3.2

The following is SD v2.0 performance GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Synonymous with num_images_per_prompt

Model

Batch Size

Sampler

Inference Steps

TRT FP 16 Latency (s)

FW FP 16 (AMP) Latency (s)

TRT vs FW Speedup (x)

Stable Diffusion (Res=512)

1

PLMS

50

0.9

3.2

3.5

Stable Diffusion (Res=512)

2

PLMS

50

1.6

5.0

3.2

Stable Diffusion (Res=512)

4

PLMS

50

2.7

8.5

3.1