Multimodal Language Models#

The endeavor to extend Language Models (LLMs) into multimodal domains by integrating additional structures like visual encoders has become a focal point of recent research, especially given its potential to significantly lower the cost compared to training multimodal universal models from scratch. Please refer to NeMo Framework User Guide for Multimodal Models for detailed support information.

Speech-agumented Large Language Models (SpeechLLM)#

The endeavor to extend Language Models (LLMs) with the ability to understand speech and audio inputs, detailed examples can be found in the SpeechLLM example..