Performance - NVIDIA Docs

Training Accuracy Results

NVIDIA DGX SuperPOD (4 x 8 x A100 80GB for ViT B/16 Model)

We pretrained a ViT B/16 model on the ImageNet 1K dataset and fine-tuned it on the same dataset at a higher resolution, following the recipe outlined in the ViT paper. As a result, we achieved a Top-1 accuracy of 79.47%, which is 1.56% higher than the reported accuracy of 77.91% in the paper. Below are the highlights of the training and fine-tuning recipe we used:

Model: ViT B/16
Dataset: ImageNet 1K
Pretraining:
- Epochs: 300
- Batch Size: 4096
- Training Resolution: 224
- Optimizer: Adam (0.9, 0.999)
- Base Learning Rate: 3.00E-03
- Learning Rate Decay: Cosine
- Weight Decay: 0.3
- Dropout: 0.1
Fine-tuning:
- Steps: 20,000
- Batch Size: 512
- Fine-tuning Resolution: 512
- Optimizer: SGD (0.9)
- Base Learning Rate: 0.003 - 0.06
- Learning Rate Decay: Cosine
- Weight Decay: 0

Training Performance Results

We measured the throughput of training Vision Transformer models on different numbers of DGX A100 nodes and DGX H100 nodes, and we achieved near-linear scaling on both platforms.

We are comparing the out-of-box performance on DGX H100 machines with the same configuration from DGX A100 machines. This comparison is an apple-to-apple assessment, ensuring that we evaluate the relative performance of the two machine types under equivalent conditions and configurations.

The tables and charts below show the performance results.

NVIDIA DGX SuperPODs (16 x 8 x A100 80GB for ViT g/14 model)

		1	2	4	8	16
	Samples per Second	693	1347	2642	5454	10644
ViT g/14	Perfect Linear Scaling (Samples)	693	1386	2773	5546	11092
	Speedup	1x	1.94x	3.81x	7.87x	15.35x

ViT g_14 NeMo Megatron Throughput (A100).svg

NVIDIA DGX SuperPODs (16 x 8 x H100 80GB for ViT g/14 model)

		1	2	4	8	16
	Samples per Second	1428	2846	5737	11222	21561
ViT g/14	Perfect Linear Scaling (Samples)	1428	2856	5713	11425	22851
	Speedup	1x	1.99x	4.02x	7.86x	15.1x

ViT g_14 NeMo Megatron Throughput (H100).svg

DGX A100 vs. DGX H100: A Comparative Analysis of Vision Transformer Training

Model	Nodes	Global Batch Size	Micro Batch Size	Precision	Global Batch / Sec (A100)	Global Batch / Sec (H100)	Speedup (x)
ViT B/16	2	4096	256	bf16 (O2)	2.54	5.10	2.2
ViT L/16	2	4096	256	bf16 (O2)	1.25	2.71	2.1
ViT H/14	4	4096	128	bf16 (O2)	1.06	2.19	2.1
ViT g/14	4	4096	64	bf16 (O2)	0.71	1.42	2.2
ViT bigG/14	4	4096	32	bf16 (O2)	0.43	0.89	2.1

Vision Transformer Training Throughput Comparison (2308).svg

Inference Performance Results

Latency times are taken as starting with an image on CPU and stopped on output. For framework we use the Torch Automated Mixed Precision (AMP) for FP16 computation. For TRT, we export the various models with the FP16 acceleration. We use the optimized TRT engine setup present in the deployment directory to get the numbers in the same environment as the framework.

GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Number of Images in a Batch

Model	Batch Size	TRT FP16 Latency (s)	FW FP 16 (AMP) Latency (s)	TRT vs FW Speedup (x)
	1	0.006	0.014	2.3
	2	0.008	0.015	1.9
ViT B/16	4	0.011	0.015	1.4
	8	0.018	0.017	1.0