Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Performance#

Training Accuracy Results#

Training Accuracy: NVIDIA DGX SuperPOD (8 x 8 x A100 80GB for CLIP B/32 Model)

We followed the training recipe from Open CLIP blog to verify our training pipeline. Our results are displayed in the table below:

Framework	Dataset	Model Name	Batch Size	Samples Seen	ImageNet Top-1
OpenCLIP	LAION 400M	B/32	32k	12B	62.90%
NeMo	Our Multimodal Blend*	B/32	32k	12B	60.13%

Note

Our multimodal dataset originated from Common Crawl with custom filtering and contains 670M image-caption pairs.

Important

Our multimodal dataset originated from Common Crawl with custom filtering and contains 670M image-caption pairs. We believe the final accuracy difference is due to the dataset, as LAION 400M is filtered with CLIP scores. To ensure our implementation is consistent with OpenCLIP, we trained OpenCLIP with our dataset and found out that the loss curve and validation accuracy were nearly identical to NeMo’s CLIP.