Performance - NVIDIA Docs

Training Quality Results

We evaluate Stable Diffusion model with FID-CLIP curve, and comparing it to other open-source ckpt at same scale of consumed sample.

FID (Fréchet Inception Distance) is a metric used to evaluate the quality of generated images in machine learning. It measures the distance between the real image distribution and the distribution of generated images using the features extracted by a pre-trained Inception model.

The VIT-L/14 version of the CLIP model was utilized to assess the relevance between image prompts and generated images.

The evaluation was conducted using different classifier-free guidance scales, specifically 1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, and 8.0. The evaluation process involved generating 30,000 images from randomly selected prompts from the COCO2014 validation dataset, with 50 PLMS steps, and evaluating the results at a resolution of 256x256.

We have referred to but made certain modifications to the training recipe outlined in Stable Diffusion Model Cards posted on Huggingface.

Our multimodal dataset is originated from Common Crawl with custom filtering.

Below, we present the outcomes obtained from our own checkpoint following Stable Diffusion Training, which can be compared to those of the open-source Stable Diffusion 1.5.

For Stable Diffusion 2.0 base, we followed the same configuration to evaluate. The results are presented below, our own checkpoint can be compared to open-source Stable Diffusion 2.0 base.

Training Performance Results

We measured the throughput of training Stable Diffusion models on different numbers of DGX A100 nodes and DGX H100 nodes, and we achieved near-linear scaling on both platforms.

We are comparing the out-of-box performance on DGX H100 machines with the same configuration from DGX A100 machines. This comparison is an apple-to-apple assessment, ensuring that we evaluate the relative performance of the two machine types under equivalent conditions and configurations.

The tables and charts below show the performance results for SD v2.

NVIDIA DGX SuperPODs (64 x 8 x A100 80GB for Stable Diffusion Res=512 model)

		1	2	4	8	16	32	64
	Samples per Second	268.31	540.14	1081.31	2138.23	4208.80	8144.76	15917.61
Stable Diffusion Res=512	Perfect Linear Scaling (Samples)	268.31	536.63	1073.26	2146.53	4293.05	8586.10	17172.20
	Speedup	1x	2.01x	4.03x	7.97x	15.69x	30.36x	59.32x

Stable_Diffusion_(Res=512)_NeMo_Megatron_Throughput_(A100).svg

NVIDIA DGX SuperPODs (64 x 8 x H100 80GB for Stable Diffusion Res=512 model)

		1	2	4	8	16	32	64
	Samples per Second	511.90	997.86	2019.72	3856.87	7177.15	13326.83	25952.80
Stable Diffusion Res=512	Perfect Linear Scaling (Samples)	511.90	1023.80	2047.59	4095.18	8190.36	16380.72	32761.45
	Speedup	1x	1.95x	3.95x	7.53x	14.02x	26.03x	50.7x

Stable_Diffusion_(Res=512)_NeMo_Megatron_Throughput_(H100).svg

DGX A100 vs. DGX H100: A Comparative Analysis of Stable Diffusion Training

Model	Nodes	Global Batch	Micro Batch	Precision	Global Batch/Sec (A100)	Global Batch/Sec (H100)	Speedup (x)
Stable Diffusion (Res=512)	4	1024	32	amp fp16	1.056	1.972	1.9

Inference Performance Results

Latency times are started directly before the text encoding (CLIP) and stopped directly after the output image decoding (VAE). For framework we use the Torch Automated Mixed Precision (AMP) for FP16 computation. For TRT, we export the various models with the FP16 acceleration. We use the optimized TRT engine setup present in the deployment directory to get the numbers in the same environment as the framework.

GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Synonymous with num_images_per_prompt

Model	Batch Size	Sampler	Inference Steps	TRT FP 16 Latency (s)	FW FP 16 (AMP) Latency (s)	TRT vs FW Speedup (x)
Stable Diffusion (Res=512)	1	PLMS	50	0.9	3.3	3.7
Stable Diffusion (Res=512)	2	PLMS	50	1.7	5.2	3.1
Stable Diffusion (Res=512)	4	PLMS	50	2.9	9.2	3.2

The following is SD v2.0 performance GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Synonymous with num_images_per_prompt

Model	Batch Size	Sampler	Inference Steps	TRT FP 16 Latency (s)	FW FP 16 (AMP) Latency (s)	TRT vs FW Speedup (x)
Stable Diffusion (Res=512)	1	PLMS	50	0.9	3.2	3.5
Stable Diffusion (Res=512)	2	PLMS	50	1.6	5.0	3.2
Stable Diffusion (Res=512)	4	PLMS	50	2.7	8.5	3.1