Performance
Training Accuracy Results
Training Accuracy: NVIDIA DGX SuperPOD (8 x 8 x A100 80GB for CLIP B/32 Model)
We followed the training recipe from Open CLIP blog to verify our training pipeline. Our results are displayed in the table below:
Framework |
Dataset |
Model Name |
Batch Size |
Samples Seen |
ImageNet Top-1 |
---|---|---|---|---|---|
OpenCLIP |
LAION 400M |
B/32 |
32k |
12B |
62.90% |
NeMo |
Our Multimodal Blend* |
B/32 |
32k |
12B |
60.13% |
Note
Our multimodal dataset is originated from Common Crawl with custom filtering and contains 670M image-caption pairs.
We believe the final accuracy difference is due to the dataset, as LAION 400M is filtered with CLIP scores. To ensure our implementation is consistent with OpenCLIP, we trained OpenCLIP with our dataset and found out that the loss curve and validation accuracy were nearly identical to NeMo’s CLIP.
Training Performance Results
We measured the throughput of training CLIP models on different numbers of DGX A100 nodes and DGX H100 nodes, and we achieved near-linear scaling on both platforms.
We are comparing the out-of-box performance on DGX H100 machines with the same configuration from DGX A100 machines. This comparison is an apple-to-apple assessment, ensuring that we evaluate the relative performance of the two machine types under equivalent conditions and configurations.
The tables and charts below show the performance results.
NVIDIA DGX SuperPODs (16 x 8 x A100 80GB for CLIP g/14 model)
1 |
2 |
4 |
8 |
16 |
||
---|---|---|---|---|---|---|
Samples per Second |
559 |
1115 |
2190 |
4407 |
8633 |
|
CLIP g/14 |
Perfect Linear Scaling (Samples) |
559 |
1119 |
2237 |
4475 |
8950 |
Speedup |
1x |
1.99x |
3.92x |
7.88x |
15.43x |
NVIDIA DGX SuperPODs (16 x 8 x H100 80GB for CLIP g/14 model)
1 |
2 |
4 |
8 |
16 |
||
---|---|---|---|---|---|---|
Samples per Second |
935 |
1795 |
3502 |
6771 |
13829 |
|
CLIP g/14 |
Perfect Linear Scaling (Samples) |
935 |
1869 |
3739 |
7478 |
14955 |
Speedup |
1x |
1.92x |
3.75x |
7.24x |
14.8x |
DGX A100 vs. DGX H100: A Comparative Analysis of CLIP Training
Model |
Nodes |
Global Batch Size |
Micro Batch Size |
Precision |
Global Batch / Sec (A100) |
Global Batch / Sec (H100) |
Speedup (x) |
---|---|---|---|---|---|---|---|
CLIP B/32 |
4 |
16000 |
500 |
bf16 (O2) |
2.12 |
5.26 |
2.5 |
CLIP H/14 |
4 |
3584 |
112 |
bf16 (O2) |
0.88 |
1.92 |
2.2 |
CLIP g/14 |
4 |
2560 |
80 |
bf16 (O2) |
0.86 |
2.25 |
2.6 |
Inference Performance Results
Latency times are taken as starting with an image on CPU and text input (of length 64) and stopped on output. For framework we use the Torch Automated Mixed Precision (AMP) for FP16 computation. For TRT, we export the various models with the FP16 acceleration. We use the optimized TRT engine setup present in the deployment directory to get the numbers in the same environment as the framework.
GPU: NVIDIA DGX A100 (1x A100 80 GB) Batch Size: Number of Images in a Batch
Model |
Batch Size |
TRT FP16 Latency (s) |
FW FP 16 (AMP) Latency (s) |
TRT vs FW Speedup (x) |
---|---|---|---|---|
1 |
0.014 |
0.032 |
2.3 |
|
2 |
0.014 |
0.033 |
2.4 |
|
CLIP B/32 |
4 |
0.014 |
0.028 |
2.0 |
8 |
0.015 |
0.028 |
1.9 |