Performance

NVIDIA DGX SuperPOD (4 x 8 x A100 80GB for ViT B/16 Model)

We pretrained a ViT B/16 model on the ImageNet 1K dataset and fine-tuned it on the same dataset at a higher resolution, following the recipe outlined in the ViT paper. As a result, we achieved a Top-1 accuracy of 79.47%, which is 1.56% higher than the reported accuracy of 77.91% in the paper. Below are the highlights of the training and fine-tuning recipe we used:

  • Model: ViT B/16

  • Dataset: ImageNet 1K

  • Pretraining:
    • Epochs: 300

    • Batch Size: 4096

    • Training Resolution: 224

    • Optimizer: Adam (0.9, 0.999)

    • Base Learning Rate: 3.00E-03

    • Learning Rate Decay: Cosine

    • Weight Decay: 0.3

    • Dropout: 0.1

  • Fine-tuning:
    • Steps: 20,000

    • Batch Size: 512

    • Fine-tuning Resolution: 512

    • Optimizer: SGD (0.9)

    • Base Learning Rate: 0.003 - 0.06

    • Learning Rate Decay: Cosine

    • Weight Decay: 0

We measured the throughput of training Vision Transformer models on different numbers of DGX A100 nodes and DGX H100 nodes, and we achieved near-linear scaling on both platforms.

We are comparing the out-of-box performance on DGX H100 machines with the same configuration from DGX A100 machines. This comparison is an apple-to-apple assessment, ensuring that we evaluate the relative performance of the two machine types under equivalent conditions and configurations.

The tables and charts below show the performance results.

  • NVIDIA DGX SuperPODs (16 x 8 x A100 80GB for ViT g/14 model)

1

2

4

8

16

Samples per Second

693

1347

2642

5454

10644

ViT g/14 Perfect Linear Scaling (Samples) 693 1386 2773 5546 11092
Speedup 1x 1.94x 3.81x 7.87x 15.35x
ViT g_14 NeMo Megatron Throughput (A100).svg

  • NVIDIA DGX SuperPODs (16 x 8 x H100 80GB for ViT g/14 model)

1

2

4

8

16

Samples per Second

1428

2846

5737

11222

21561

ViT g/14 Perfect Linear Scaling (Samples) 1428 2856 5713 11425 22851
Speedup 1x 1.99x 4.02x 7.86x 15.1x
ViT g_14 NeMo Megatron Throughput (H100).svg

  • DGX A100 vs. DGX H100: A Comparative Analysis of Vision Transformer Training

Model

Nodes

Global Batch Size

Micro Batch Size

Precision

Global Batch / Sec (A100)

Global Batch / Sec (H100)

Speedup (x)

ViT B/16 2 4096 256 bf16 (O2) 2.54 5.10 2.2
ViT L/16 2 4096 256 bf16 (O2) 1.25 2.71 2.1
ViT H/14 4 4096 128 bf16 (O2) 1.06 2.19 2.1
ViT g/14 4 4096 64 bf16 (O2) 0.71 1.42 2.2
ViT bigG/14 4 4096 32 bf16 (O2) 0.43 0.89 2.1
Vision Transformer Training Throughput Comparison (2308).svg

Latency times are taken as starting with an image on CPU and stopped on output. For framework we use the Torch Automated Mixed Precision (AMP) for FP16 computation. For TRT, we export the various models with the FP16 acceleration. We use the optimized TRT engine setup present in the deployment directory to get the numbers in the same environment as the framework.

GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Number of Images in a Batch

Model

Batch Size

TRT FP16 Latency (s)

FW FP 16 (AMP) Latency (s)

TRT vs FW Speedup (x)

1 0.006 0.014 2.3
2 0.008 0.015 1.9
ViT B/16 4 0.011 0.015 1.4
8 0.018 0.017 1.0
Previous Model Deployment
Next InstructPix2Pix
© Copyright 2023-2024, NVIDIA. Last updated on Feb 22, 2024.