Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Vision-Language Foundation#

Humans naturally process information using multiple senses like sight and sound. Similarly, multimodal learning aims to create models that handle different data types, such as images, text, and audio. There’s a growing trend in models that combine vision and language, like OpenAI’s CLIP. These models excel at tasks like aligning image and text features, image captioning, and visual question-answering. Their ability to generalize without specific training offers many practical uses. Please refer to NeMo Framework User Guide for Multimodal Models for detailed support information.