Performance

Training Accuracy Results

We evaluate Imagen model with FID-CLIP curve, and comparing it to other open-source ckpt at same scale of consumed sample.

FID (Fréchet Inception Distance) is a metric used to evaluate the quality of generated images in machine learning. It measures the distance between the real image distribution and the distribution of generated images using the features extracted by a pre-trained Inception model.

The VIT-L/14 version of the CLIP model was utilized to assess the relevance between image prompts and generated images.

The evaluation was conducted using different classifier-free guidance scales, specifically 1.0, 1.5, 2.0, 3.0, 4.0, 5.0, and 6.0. The evaluation process involved generating 30,000 images from randomly selected prompts from the COCO2014 validation dataset, with 30 EDM steps on the base64 model 20 EDM steps on the sr256 model, and evaluating the results at a resolution of 256x256.

We have referred to but made certain modifications to the training recipe outlined in Imagen Paper.

Please note that our curve cannot be directly compared to the plots presented in the paper for several reasons:

Dataset Discrepancy: Our dataset differs from the one used by the Imagen research team, and it is also smaller in size.

Model Variation: In order to ensure convergence, we made the decision to train a smaller variant of the model (500M) instead of the proposed 2B variant.

Encoder Difference: The paper utilizes a T5-XXL encoder with 4096 dimensions, whereas we employed a T5-11B encoder with 1024 dimensions during training. This choice was made due to disk space limitations to store the precached T5 embeddings.

Additionally, the FID score obtained is slightly higher than our Stable Diffusion results. This is because we only used a subset of the dataset for training Imagen, as precaching T5 embeddings proves to be resource-intensive on disk.

Our multimodal dataset is originated from Common Crawl with custom filtering.

Below, we present the outcomes obtained from our own checkpoint following [Section 6.3.7](#637-imagen-training).

Training Performance Results

We measured the throughput of training and fine-tuning NeVA models on different numbers of DGX A100 nodes and DGX H100 nodes, and we achieved near-linear scaling on both platforms.

We are comparing the out-of-box performance on DGX H100 machines with the same configuration from DGX A100 machines. This comparison is an apple-to-apple assessment, ensuring that we evaluate the relative performance of the two machine types under equivalent conditions and configurations.

The tables and charts below show the performance results.

Pretraining Performance:

NVIDIA DGX SuperPODs (16 x 8 x A100 80GB for Imagen Base 2B model)

					Node
		1	2	4	8	16
ImagenBase (2B, res=64)	Samples per Second	264.16	527.65	987.99	2028.73	3985.98
	Perfect Linear Scaling (Samples)	264.16	528.32	1056.64	2113.29	4226.59
	Speedup	1x	2x(99.87%)	3.74x(93.5%)	7.68x(96%)	15.09x(94.31%)

NVIDIA DGX SuperPODs (64 x 8 x A100 80GB for Imagen Base 500M model)

					Node
		1	2	4	8	16
ImagenBase (500M, res=64)	Samples per Second	902.69	1773.35	3473.26	6692.86	13145.11
	Perfect Linear Scaling (Samples)	902.69	1805.38	3610.77	7221.54	14443.08
	Speedup	1x	1.96x(98.23%)	3.74x(96.19%)	7.41x(92.68%)	14.56x(91.01%)

NVIDIA DGX SuperPODs (16 x 8 x H100 80GB for Imagen Base 2B model)

					Node
		1	2	4	8	16
ImagenBase (2B, res=64)	Samples per Second	495.52	964.16	1922.88	3748.48	7484.16
	Perfect Linear Scaling (Samples)	495.52	991.04	1982.08	3964.16	7928.32
	Speedup	1x	1.95x(97.29%)	3.88x(97.01%)	7.56x(94.56%)	15.1x(94.40%)

DGX A100 vs. DGX H100: A Comparative Analysis of Imagen Training

Model	Nodes	Global Batch Size	Micro Batch Size	Precision	Global Batch / Sec (A100)	Global Batch / Sec (H100)	Speedup (x)
ImagenBase (500M, Res=64)	4	2048	64	bf16 (O1)	1.489	3.162	2.1
ImagenBase (2B, Res=64)	4	512	16	bf16 (O1)	1.661	3.631	2.2
ImagenBase (400M, Res=256)	4	512	16	bf16 (O1)	1.609	3.166	2.0
ImagenBase (600M, Res=256)	4	2048	64	bf16 (O1)	1.264	2.668	2.1
ImagenBase (600M, Res=1024)	4	2048	64	bf16 (O1)	1.319	2.837	2.2