Vision Language Models (VLMs)#

Introduction#

Vision Language Models (VLMs) are advanced models that integrate vision and language processing capabilities. They are trained on extensive datasets containing both interleaved images and text data, allowing them to generate text descriptions of images and answer questions related to images.

NeMo AutoModel LLM APIs can be easily extended to support VLM tasks. While most of the training setup is the same, some additional steps are required to prepare the data and model for VLM training.

Run VLMs with NeMo AutoModel#

To run VLMs with NeMo AutoModel, use NeMo container version 25.11.00 or later. If the model you want to finetune requires a newer version of Transformers, you may need to upgrade to the latest NeMo AutoModel using:

   pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git

For other installation options (e.g., uv) please see our Installation Guide.

Supported Models#

NeMo AutoModel supports AutoModelForImageTextToText in the Image-Text-to-Text category. Specifically, the following VLM models from Hugging Face have been tested and support both Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT) with LoRA:

Model	Dataset	FSDP2	PEFT	Example YAML
Kimi-VL-A3B-Instruct & Kimi-K25-VL	cord-v2, MedPix-VQA	Supported	Supported	kimi2vl_cordv2.yaml, kimi25vl_medpix.yaml
Gemma 3-4B & 27B	cord-v2, MedPix-VQA	Supported	Supported	gemma3_vl_4b_cord_v2.yaml, gemma3_vl_4b_cord_v2_peft.yaml, gemma3_vl_4b_cord_v2_megatron_fsdp.yaml, gemma3_vl_4b_medpix.yaml, gemma3_vl_4b_medpix_peft.yaml
Gemma 3n	MedPix-VQA	Supported	Supported	gemma3n_vl_4b_medpix.yaml, gemma3n_vl_4b_medpix_peft.yaml
Nemotron-Parse-v1.1	cord-v2	Supported	Supported	nemotron_parse_v1_1.yaml
Qwen2.5-VL-3B-Instruct	rdr-items	Supported	Supported	qwen2_5_vl_3b_rdr.yaml
Qwen3-VL-{4B,8B}-Instruct	rdr-items	Supported	Supported	qwen3_vl_4b_instruct_rdr.yaml, qwen3_vl_8b_instruct_rdr.yaml
Qwen3-VL-MoE	MedPix-VQA	Supported	Supported	qwen3_vl_moe_30b_te_deepep.yaml, qwen3_vl_moe_235b.yaml
Qwen3-Omni-30BA3B	MedPix-VQA	Supported	Supported	qwen3_omni_moe_30b_te_deepep.yaml
InternVL3.5-4B	MedPix-VQA	Supported	Supported	internvl_3_5_4b.yaml
Ministral3-{3B,8B,14B}	MedPix-VQA	Supported	Supported	ministral3_3b_medpix.yaml, ministral3_8b_medpix.yaml, ministral3_14b_medpix.yaml
Qwen3.5-MoE	MedPix-VQA	Supported	Supported	qwen3_5_moe_medpix.yaml, qwen3_5_35b.yaml
Qwen3.5-{4B,9B}	MedPix-VQA	Supported	Supported	qwen3_5_4b.yaml, qwen3_5_9b.yaml
Mistral-Small-4-119B	MedPix-VQA	Supported	Supported	mistral4_medpix.yaml
Phi-4-multimodal-instruct	commonvoice_17_tr_fixed	Supported	Supported	phi4_mm_cv17.yaml

For detailed instructions on fine-tuning these models using both SFT and PEFT approaches, please refer to the Gemma 3 and Gemma 3n Fine-Tuning Guide. The guide covers dataset preparation, configuration, and running both full fine-tuning and LoRA-based parameter efficient fine-tuning.

Additional VLMs with FSDP2 Support#

The following VLM architectures have built-in FSDP2 parallelization support in NeMo AutoModel. They can be loaded via NeMoAutoModelForImageTextToText.from_pretrained and used with FSDP2 for distributed fine-tuning, though they do not yet have dedicated example configs.

Architecture	Example HF Models
`Qwen2VLForConditionalGeneration`	`Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-2B-Instruct`
`SmolVLMForConditionalGeneration`	`HuggingFaceTB/SmolVLM-Instruct`, `HuggingFaceTB/SmolVLM-256M-Instruct`
`LlavaForConditionalGeneration`	`llava-hf/llava-1.5-7b-hf`, `llava-hf/llava-1.5-13b-hf`
`LlavaNextForConditionalGeneration`	`llava-hf/llava-v1.6-mistral-7b-hf`, `llava-hf/llava-v1.6-34b-hf`
`LlavaNextVideoForConditionalGeneration`	`llava-hf/LLaVA-NeXT-Video-7B-hf`
`LlavaOnevisionForConditionalGeneration`	`llava-hf/llava-onevision-qwen2-7b-ov-hf`
`Llama4ForConditionalGeneration`	`meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`

Dataset Examples#

Tip

In these guides, we use the quintend/rdr-items and naver-clova-ix/cord-v2 datasets for demonstration purposes, but you can use your own data.

To do so, update the recipe YAML dataset section (for example dataset._target_, path_or_dataset, and split) and ensure your dataloader.collate_fn matches the model/dataset. See VLM datasets and dataset overview.

RDR Items Dataset#

The rdr items dataset quintend/rdr-items is a small dataset containing 48 images with descriptions. This dataset serves as an example of how to prepare image-text data for VLM fine-tuning. For complete instructions on dataset preprocessing and the collate functions used, see the Gemma Fine-Tuning Guide.

CORD-v2 Dataset#

The cord-v2 dataset naver-clova-ix/cord-v2 contains receipts with descriptions in JSON format. This demonstrates handling structured data in VLMs. The Gemma Fine-Tuning Guide provides detailed examples of custom preprocessing and collate functions for similar datasets.

Train VLM Models#

All supported models can be fine-tuned using either full SFT or PEFT approaches. The Gemma Fine-Tuning Guide provides complete instructions for:

Configuring YAML-based training.
Running single-GPU and multi-GPU training.
Setting up PEFT with LoRA.
Model checkpointing and W&B integration.