Multimodal Language Models

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide Multimodal Language Models

The endeavor to extend Language Models (LLMs) into multimodal domains by integrating additional structures like visual encoders has become a focal point of recent research, especially given its potential to significantly lower the cost compared to training multimodal universal models from scratch. Please refer to NeMo Framework User Guide for Multimodal Models for detailed support information.

Multimodal Language Model Datasets
Common Configuration Files
Checkpoints
NeVA

Previous Resources and Documentation

Next Multimodal Language Model Datasets