Vision Language Models (VLMs)#

Introduction#

Vision Language Models (VLMs) integrate vision and language processing capabilities, enabling models to understand images and generate text descriptions, answer visual questions, and perform multimodal reasoning.

NeMo AutoModel LLM APIs can be easily extended to support VLM tasks. While most of the training setup is the same as for LLMs, some additional steps are required to prepare the data and model for VLM training.

Run VLMs with NeMo AutoModel#

To run VLMs with NeMo AutoModel, use NeMo container version 25.11.00 or later. If the model you want to fine-tune requires a newer version of Transformers, you may need to upgrade:

pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git

For other installation options, see our Installation Guide.

Supported Models#

NeMo AutoModel supports AutoModelForImageTextToText in the Image-Text-to-Text category.

Owner

Model

Architectures

Moonshot AI

Kimi-VL

KimiVLForConditionalGeneration

Google

Gemma 3 VL / Gemma 3n

Gemma3ForConditionalGeneration

Qwen / Alibaba Cloud

Qwen2.5-VL

Qwen2VLForConditionalGeneration, Qwen2_5VLForConditionalGeneration

Qwen / Alibaba Cloud

Qwen3-VL / Qwen3-VL-MoE

Qwen3VLForConditionalGeneration

Qwen / Alibaba Cloud

Qwen3.5-VL

Qwen3_5VLForConditionalGeneration, Qwen3_5MoeVLForConditionalGeneration

NVIDIA

Nemotron-Parse

NemotronParseForConditionalGeneration

Mistral AI

Ministral3 VL

Mistral3ForConditionalGeneration

Mistral AI

Mistral-Small-4

MistralForConditionalGeneration

InternLM / Shanghai AI Lab

InternVL

InternVLForConditionalGeneration

Meta

Llama 4

Llama4ForConditionalGeneration

HuggingFace

SmolVLM

SmolVLMForConditionalGeneration

LLaVA

LLaVA

LlavaForConditionalGeneration, LlavaNextForConditionalGeneration, LlavaNextVideoForConditionalGeneration, LlavaOnevisionForConditionalGeneration

Fine-Tuning#

All supported models can be fine-tuned using either full SFT or PEFT (LoRA) approaches. See the Gemma 3 Fine-Tuning Guide for a complete walkthrough covering dataset preparation, configuration, and multi-GPU training.

Tip

In these guides, we use the quintend/rdr-items and naver-clova-ix/cord-v2 datasets for demonstration purposes. Update the recipe YAML dataset section to use your own data. See VLM datasets and dataset overview.