Vision Language Models (VLMs)#
Introduction#
Vision Language Models (VLMs) integrate vision and language processing capabilities, enabling models to understand images and generate text descriptions, answer visual questions, and perform multimodal reasoning.
NeMo AutoModel LLM APIs can be easily extended to support VLM tasks. While most of the training setup is the same as for LLMs, some additional steps are required to prepare the data and model for VLM training.
Run VLMs with NeMo AutoModel#
To run VLMs with NeMo AutoModel, use NeMo container version 25.11.00 or later. If the model you want to fine-tune requires a newer version of Transformers, you may need to upgrade:
pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git
For other installation options, see our Installation Guide.
Supported Models#
NeMo AutoModel supports AutoModelForImageTextToText in the Image-Text-to-Text category.
Owner |
Model |
Architectures |
|---|---|---|
Moonshot AI |
|
|
|
||
Qwen / Alibaba Cloud |
|
|
Qwen / Alibaba Cloud |
|
|
Qwen / Alibaba Cloud |
|
|
NVIDIA |
|
|
Mistral AI |
|
|
Mistral AI |
|
|
InternLM / Shanghai AI Lab |
|
|
Meta |
|
|
HuggingFace |
|
|
LLaVA |
|
Fine-Tuning#
All supported models can be fine-tuned using either full SFT or PEFT (LoRA) approaches. See the Gemma 3 Fine-Tuning Guide for a complete walkthrough covering dataset preparation, configuration, and multi-GPU training.
Tip
In these guides, we use the quintend/rdr-items and naver-clova-ix/cord-v2 datasets for demonstration purposes. Update the recipe YAML dataset section to use your own data. See VLM datasets and dataset overview.