Vision Language Models (VLMs)

Introduction

Vision Language Models (VLMs) integrate vision and language processing capabilities, enabling models to understand images and generate text descriptions, answer visual questions, and perform multimodal reasoning.

NeMo AutoModel LLM APIs can be easily extended to support VLM tasks. While most of the training setup is the same as for LLMs, some additional steps are required to prepare the data and model for VLM training.

Run VLMs with NeMo AutoModel

To run VLMs with NeMo AutoModel, use NeMo container version 26.04.00 or later. If the model you want to fine-tune requires a newer version of Transformers, you may need to upgrade:

$ pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git

For other installation options, see our Installation Guide.

Supported Models

NeMo AutoModel supports AutoModelForImageTextToText in the Image-Text-to-Text category.

Owner	Model	Architectures
Moonshot AI	Kimi-VL	`KimiVLForConditionalGeneration`
Google	Gemma 3 VL / Gemma 3n	`Gemma3ForConditionalGeneration`
Google	Gemma 4	`Gemma4ForConditionalGeneration`
Qwen / Alibaba Cloud	Qwen2.5-VL	`Qwen2VLForConditionalGeneration`, `Qwen2_5VLForConditionalGeneration`
Qwen / Alibaba Cloud	Qwen3-VL / Qwen3-VL-MoE	`Qwen3VLForConditionalGeneration`
Qwen / Alibaba Cloud	Qwen3.5-VL	`Qwen3_5VLForConditionalGeneration`, `Qwen3_5MoeVLForConditionalGeneration`
NVIDIA	Nemotron-Parse	`NemotronParseForConditionalGeneration`
Mistral AI	Ministral3 VL	`Mistral3ForConditionalGeneration`
Mistral AI	Mistral-Small-4	`MistralForConditionalGeneration`
Mistral AI	Mistral Medium 3.5	`Mistral3ForConditionalGeneration` (FP8)
InternLM / Shanghai AI Lab	InternVL	`InternVLForConditionalGeneration`
Meta	Llama 4	`Llama4ForConditionalGeneration`
HuggingFace	SmolVLM	`SmolVLMForConditionalGeneration`
LLaVA	LLaVA	`LlavaForConditionalGeneration`, `LlavaNextForConditionalGeneration`, `LlavaNextVideoForConditionalGeneration`, `LlavaOnevisionForConditionalGeneration`
lmms-lab	LLaVA-OneVision 1.5	`LlavaOneVisionForConditionalGeneration`

Fine-Tuning

All supported models can be fine-tuned using either full SFT or PEFT (LoRA) approaches. See the Gemma 3 Fine-Tuning Guide for a complete walkthrough covering dataset preparation, configuration, and multi-GPU training.

In these guides, we use the quintend/rdr-items and naver-clova-ix/cord-v2 datasets for demonstration purposes. Update the recipe YAML dataset section to use your own data. See VLM datasets and dataset overview.