Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Multimodal Language Models#

The endeavor to extend Language Models (LLMs) into multimodal domains by integrating additional structures like visual encoders has become a focal point of recent research, especially given its potential to significantly lower the cost compared to training multimodal universal models from scratch. Please refer to NeMo Framework User Guide for Multimodal Models for detailed support information.

Speech-agumented Large Language Models (SpeechLLM)#

The endeavor to extend Language Models (LLMs) with the ability to understand speech and audio inputs, detailed examples can be found in the SpeechLLM example..