Vision Language Models (VLMs)

View as Markdown

Introduction

Vision Language Models (VLMs) integrate vision and language processing capabilities, enabling models to understand images and generate text descriptions, answer visual questions, and perform multimodal reasoning.

NeMo AutoModel LLM APIs can be easily extended to support VLM tasks. While most of the training setup is the same as for LLMs, some additional steps are required to prepare the data and model for VLM training.

Run VLMs with NeMo AutoModel

To run VLMs with NeMo AutoModel, use NeMo container version 26.04.00 or later. If the model you want to fine-tune requires a newer version of Transformers, you may need to upgrade:

$pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git

For other installation options, see our Installation Guide.

Supported Models

NeMo AutoModel supports AutoModelForImageTextToText in the Image-Text-to-Text category.

OwnerModelArchitectures
Moonshot AIKimi-VLKimiVLForConditionalGeneration
GoogleGemma 3 VL / Gemma 3nGemma3ForConditionalGeneration
GoogleGemma 4Gemma4ForConditionalGeneration
Qwen / Alibaba CloudQwen2.5-VLQwen2VLForConditionalGeneration, Qwen2_5VLForConditionalGeneration
Qwen / Alibaba CloudQwen3-VL / Qwen3-VL-MoEQwen3VLForConditionalGeneration
Qwen / Alibaba CloudQwen3.5-VLQwen3_5VLForConditionalGeneration, Qwen3_5MoeVLForConditionalGeneration
NVIDIANemotron-ParseNemotronParseForConditionalGeneration
Mistral AIMinistral3 VLMistral3ForConditionalGeneration
Mistral AIMistral-Small-4MistralForConditionalGeneration
Mistral AIMistral Medium 3.5Mistral3ForConditionalGeneration (FP8)
InternLM / Shanghai AI LabInternVLInternVLForConditionalGeneration
MetaLlama 4Llama4ForConditionalGeneration
HuggingFaceSmolVLMSmolVLMForConditionalGeneration
LLaVALLaVALlavaForConditionalGeneration, LlavaNextForConditionalGeneration, LlavaNextVideoForConditionalGeneration, LlavaOnevisionForConditionalGeneration
lmms-labLLaVA-OneVision 1.5LlavaOneVisionForConditionalGeneration

Fine-Tuning

All supported models can be fine-tuned using either full SFT or PEFT (LoRA) approaches. See the Gemma 3 Fine-Tuning Guide for a complete walkthrough covering dataset preparation, configuration, and multi-GPU training.

In these guides, we use the quintend/rdr-items and naver-clova-ix/cord-v2 datasets for demonstration purposes. Update the recipe YAML dataset section to use your own data. See VLM datasets and dataset overview.