Multimodal Language Models

The endeavor to extend Language Models (LLMs) into multimodal domains by integrating additional structures like visual encoders has become a focal point of recent research, especially given its potential to significantly lower the cost compared to training multimodal universal models from scratch. Please refer to NeMo Framework User Guide for Multimodal Models for detailed support information.

Previous Resources and Documentation
Next Multimodal Language Model Datasets
© Copyright 2023-2024, NVIDIA. Last updated on Apr 12, 2024.