Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Performance
Training Accuracy Results
Training Accuracy: NVIDIA DGX SuperPOD (8 x 8 x A100 80GB for CLIP B/32 Model)
We followed the training recipe from Open CLIP blog to verify our training pipeline. Our results are displayed in the table below:
Framework |
Dataset |
Model Name |
Batch Size |
Samples Seen |
ImageNet Top-1 |
---|---|---|---|---|---|
OpenCLIP |
LAION 400M |
B/32 |
32k |
12B |
62.90% |
NeMo |
Our Multimodal Blend* |
B/32 |
32k |
12B |
60.13% |
Note
Our multimodal dataset originated from Common Crawl with custom filtering and contains 670M image-caption pairs.
Important
Our multimodal dataset originated from Common Crawl with custom filtering and contains 670M image-caption pairs. We believe the final accuracy difference is due to the dataset, as LAION 400M is filtered with CLIP scores. To ensure our implementation is consistent with OpenCLIP, we trained OpenCLIP with our dataset and found out that the loss curve and validation accuracy were nearly identical to NeMo’s CLIP.