Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Performance
Training Accuracy Results
NVIDIA DGX SuperPOD (4 x 8 x A100 80GB for ViT B/16 Model)
We pretrained a ViT B/16 model on the ImageNet 1K dataset and fine-tuned it on the same dataset at a higher resolution, following the recipe outlined in the ViT paper. As a result, we achieved a Top-1 accuracy of 79.47%, which is 1.56% higher than the reported accuracy of 77.91% in the paper. Below are the highlights of the training and fine-tuning recipe we used:
Model: ViT B/16
Dataset: ImageNet 1K
- Pretraining:
Epochs: 300
Batch Size: 4096
Training Resolution: 224
Optimizer: Adam (0.9, 0.999)
Base Learning Rate: 3.00E-03
Learning Rate Decay: Cosine
Weight Decay: 0.3
Dropout: 0.1
- Fine-tuning:
Steps: 20,000
Batch Size: 512
Fine-tuning Resolution: 512
Optimizer: SGD (0.9)
Base Learning Rate: 0.003 - 0.06
Learning Rate Decay: Cosine
Weight Decay: 0
Training Performance Results
We measured the throughput of training Vision Transformer models on different numbers of DGX A100 nodes and DGX H100 nodes, and we achieved near-linear scaling on both platforms.
We are comparing the out-of-box performance on DGX H100 machines with the same configuration from DGX A100 machines. This comparison is an apple-to-apple assessment, ensuring that we evaluate the relative performance of the two machine types under equivalent conditions and configurations.
The tables and charts below show the performance results.
NVIDIA DGX SuperPODs (16 x 8 x A100 80GB for ViT g/14 model)
1 |
2 |
4 |
8 |
16 |
||
---|---|---|---|---|---|---|
Samples per Second |
693 |
1347 |
2642 |
5454 |
10644 |
|
ViT g/14 |
Perfect Linear Scaling (Samples) |
693 |
1386 |
2773 |
5546 |
11092 |
Speedup |
1x |
1.94x |
3.81x |
7.87x |
15.35x |
NVIDIA DGX SuperPODs (16 x 8 x H100 80GB for ViT g/14 model)
1 |
2 |
4 |
8 |
16 |
||
---|---|---|---|---|---|---|
Samples per Second |
1428 |
2846 |
5737 |
11222 |
21561 |
|
ViT g/14 |
Perfect Linear Scaling (Samples) |
1428 |
2856 |
5713 |
11425 |
22851 |
Speedup |
1x |
1.99x |
4.02x |
7.86x |
15.1x |
DGX A100 vs. DGX H100: A Comparative Analysis of Vision Transformer Training
Model |
Nodes |
Global Batch Size |
Micro Batch Size |
Precision |
Global Batch / Sec (A100) |
Global Batch / Sec (H100) |
Speedup (x) |
---|---|---|---|---|---|---|---|
ViT B/16 |
2 |
4096 |
256 |
bf16 (O2) |
2.54 |
5.10 |
2.2 |
ViT L/16 |
2 |
4096 |
256 |
bf16 (O2) |
1.25 |
2.71 |
2.1 |
ViT H/14 |
4 |
4096 |
128 |
bf16 (O2) |
1.06 |
2.19 |
2.1 |
ViT g/14 |
4 |
4096 |
64 |
bf16 (O2) |
0.71 |
1.42 |
2.2 |
ViT bigG/14 |
4 |
4096 |
32 |
bf16 (O2) |
0.43 |
0.89 |
2.1 |
Inference Performance Results
Latency times are taken as starting with an image on CPU and stopped on output. For framework we use the Torch Automated Mixed Precision (AMP) for FP16 computation. For TRT, we export the various models with the FP16 acceleration. We use the optimized TRT engine setup present in the deployment directory to get the numbers in the same environment as the framework.
GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Number of Images in a Batch
Model |
Batch Size |
TRT FP16 Latency (s) |
FW FP 16 (AMP) Latency (s) |
TRT vs FW Speedup (x) |
---|---|---|---|---|
1 |
0.006 |
0.014 |
2.3 |
|
2 |
0.008 |
0.015 |
1.9 |
|
ViT B/16 |
4 |
0.011 |
0.015 |
1.4 |
8 |
0.018 |
0.017 |
1.0 |