Performance

Training Accuracy Results

NVIDIA DGX SuperPOD (4 x 8 x A100 80GB for ViT B/16 Model)

We pretrained a ViT B/16 model on the ImageNet 1K dataset and fine-tuned it on the same dataset at a higher resolution, following the recipe outlined in the ViT paper. As a result, we achieved a Top-1 accuracy of 79.47%, which is 1.56% higher than the reported accuracy of 77.91% in the paper. Below are the highlights of the training and fine-tuning recipe we used:

  • Model: ViT B/16

  • Dataset: ImageNet 1K

  • Pretraining:
    • Epochs: 300

    • Batch Size: 4096

    • Training Resolution: 224

    • Optimizer: Adam (0.9, 0.999)

    • Base Learning Rate: 3.00E-03

    • Learning Rate Decay: Cosine

    • Weight Decay: 0.3

    • Dropout: 0.1

  • Fine-tuning:
    • Steps: 20,000

    • Batch Size: 512

    • Fine-tuning Resolution: 512

    • Optimizer: SGD (0.9)

    • Base Learning Rate: 0.003 - 0.06

    • Learning Rate Decay: Cosine

    • Weight Decay: 0

Training Performance Results

We measured the throughput of training Vision Transformer models on different numbers of DGX A100 nodes and DGX H100 nodes, and we achieved near-linear scaling on both platforms.

We are comparing the out-of-box performance on DGX H100 machines with the same configuration from DGX A100 machines. This comparison is an apple-to-apple assessment, ensuring that we evaluate the relative performance of the two machine types under equivalent conditions and configurations.

The tables and charts below show the performance results.

  • NVIDIA DGX SuperPODs (16 x 8 x A100 80GB for ViT g/14 model)

1

2

4

8

16

Samples per Second

693

1347

2642

5454

10644

ViT g/14

Perfect Linear Scaling (Samples)

693

1386

2773

5546

11092

Speedup

1x

1.94x

3.81x

7.87x

15.35x

ViT g/14 NeMo Throughput (A100)
  • NVIDIA DGX SuperPODs (16 x 8 x H100 80GB for ViT g/14 model)

1

2

4

8

16

Samples per Second

1428

2846

5737

11222

21561

ViT g/14

Perfect Linear Scaling (Samples)

1428

2856

5713

11425

22851

Speedup

1x

1.99x

4.02x

7.86x

15.1x

ViT g/14 NeMo Megatron Throughput (H100)
  • DGX A100 vs. DGX H100: A Comparative Analysis of Vision Transformer Training

Model

Nodes

Global Batch Size

Micro Batch Size

Precision

Global Batch / Sec (A100)

Global Batch / Sec (H100)

Speedup (x)

ViT B/16

2

4096

256

bf16 (O2)

2.54

5.10

2.2

ViT L/16

2

4096

256

bf16 (O2)

1.25

2.71

2.1

ViT H/14

4

4096

128

bf16 (O2)

1.06

2.19

2.1

ViT g/14

4

4096

64

bf16 (O2)

0.71

1.42

2.2

ViT bigG/14

4

4096

32

bf16 (O2)

0.43

0.89

2.1

Vision Transformer Training Throughput Comparison

Inference Performance Results

Latency times are taken as starting with an image on CPU and stopped on output. For framework we use the Torch Automated Mixed Precision (AMP) for FP16 computation. For TRT, we export the various models with the FP16 acceleration. We use the optimized TRT engine setup present in the deployment directory to get the numbers in the same environment as the framework.

GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Number of Images in a Batch

Model

Batch Size

TRT FP16 Latency (s)

FW FP 16 (AMP) Latency (s)

TRT vs FW Speedup (x)

1

0.006

0.014

2.3

2

0.008

0.015

1.9

ViT B/16

4

0.011

0.015

1.4

8

0.018

0.017

1.0