Vision Language Models (VLMs)#

Introduction#

Vision Language Models (VLMs) are advanced models that integrate vision and language processing capabilities. They are trained on extensive datasets containing both interleaved images and text data, allowing them to generate text descriptions of images and answer questions related to images.

NeMo AutoModel LLM APIs can be easily extended to support VLM tasks. While most of the training setup is the same, some additional steps are required to prepare the data and model for VLM training.

Run VLMs with NeMo AutoModel#

To run VLMs with NeMo AutoModel, use NeMo container version 25.11.00 or later. If the model you want to fine-tune requires a newer version of Transformers, you may need to upgrade to the latest NeMo AutoModel using:

   pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git

For other installation options (e.g., uv) please see our Installation Guide.

Supported Models#

NeMo AutoModel supports AutoModelForImageTextToText in the Image-Text-to-Text category. Specifically, the following VLM models from Hugging Face have been tested and support both Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT) with LoRA:

Model

Dataset

FSDP2

PEFT

Example YAML

Kimi-VL-A3B-Instruct & Kimi-K25-VL

cord-v2, MedPix-VQA

Supported

Supported

kimi2vl_cordv2.yaml, kimi25vl_medpix.yaml

Gemma 3-4B & 27B

cord-v2, MedPix-VQA

Supported

Supported

gemma3_vl_4b_cord_v2.yaml, gemma3_vl_4b_cord_v2_peft.yaml, gemma3_vl_4b_cord_v2_megatron_fsdp.yaml, gemma3_vl_4b_medpix.yaml, gemma3_vl_4b_medpix_peft.yaml

Gemma 3n

MedPix-VQA

Supported

Supported

gemma3n_vl_4b_medpix.yaml, gemma3n_vl_4b_medpix_peft.yaml

Nemotron-Parse-v1.1

cord-v2

Supported

Supported

nemotron_parse_v1_1.yaml

Qwen2.5-VL-3B-Instruct

rdr-items

Supported

Supported

qwen2_5_vl_3b_rdr.yaml

Qwen3-VL-{4B,8B}-Instruct

rdr-items

Supported

Supported

qwen3_vl_4b_instruct_rdr.yaml, qwen3_vl_8b_instruct_rdr.yaml

Qwen3-VL-MoE

MedPix-VQA

Supported

Supported

qwen3_vl_moe_30b_te_deepep.yaml, qwen3_vl_moe_235b.yaml

Qwen3-Omni-30BA3B

MedPix-VQA

Supported

Supported

qwen3_omni_moe_30b_te_deepep.yaml

InternVL3.5-4B

MedPix-VQA

Supported

Supported

internvl_3_5_4b.yaml

Ministral3-{3B,8B,14B}

MedPix-VQA

Supported

Supported

ministral3_3b_medpix.yaml, ministral3_8b_medpix.yaml, ministral3_14b_medpix.yaml

Qwen3.5-MoE

MedPix-VQA

Supported

Supported

qwen3_5_moe_medpix.yaml

Phi-4-multimodal-instruct

commonvoice_17_tr_fixed

Supported

Supported

phi4_mm_cv17.yaml

For detailed instructions on fine-tuning these models using both SFT and PEFT approaches, please refer to the Gemma 3 and Gemma 3n Fine-Tuning Guide. The guide covers dataset preparation, configuration, and running both full fine-tuning and LoRA-based parameter efficient fine-tuning.

Additional VLMs with FSDP2 Support#

The following VLM architectures have built-in FSDP2 parallelization support in NeMo AutoModel. They can be loaded via NeMoAutoModelForImageTextToText.from_pretrained and used with FSDP2 for distributed fine-tuning, though they do not yet have dedicated example configs.

Architecture

Example HF Models

Qwen2VLForConditionalGeneration

Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2-VL-2B-Instruct

SmolVLMForConditionalGeneration

HuggingFaceTB/SmolVLM-Instruct, HuggingFaceTB/SmolVLM-256M-Instruct

LlavaForConditionalGeneration

llava-hf/llava-1.5-7b-hf, llava-hf/llava-1.5-13b-hf

LlavaNextForConditionalGeneration

llava-hf/llava-v1.6-mistral-7b-hf, llava-hf/llava-v1.6-34b-hf

LlavaNextVideoForConditionalGeneration

llava-hf/LLaVA-NeXT-Video-7B-hf

LlavaOnevisionForConditionalGeneration

llava-hf/llava-onevision-qwen2-7b-ov-hf

Llama4ForConditionalGeneration

meta-llama/Llama-4-Scout-17B-16E-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct

Dataset Examples#

Tip

In these guides, we use the quintend/rdr-items and naver-clova-ix/cord-v2 datasets for demonstration purposes, but you can use your own data.

To do so, update the recipe YAML dataset section (for example dataset._target_, path_or_dataset, and split) and ensure your dataloader.collate_fn matches the model/dataset. See VLM datasets and dataset overview.

RDR Items Dataset#

The rdr items dataset quintend/rdr-items is a small dataset containing 48 images with descriptions. This dataset serves as an example of how to prepare image-text data for VLM fine-tuning. For complete instructions on dataset preprocessing and the collate functions used, see the Gemma Fine-Tuning Guide.

CORD-v2 Dataset#

The cord-v2 dataset naver-clova-ix/cord-v2 contains receipts with descriptions in JSON format. This demonstrates handling structured data in VLMs. The Gemma Fine-Tuning Guide provides detailed examples of custom preprocessing and collate functions for similar datasets.

Train VLM Models#

All supported models can be fine-tuned using either full SFT or PEFT approaches. The Gemma Fine-Tuning Guide provides complete instructions for:

  • Configuring YAML-based training.

  • Running single-GPU and multi-GPU training.

  • Setting up PEFT with LoRA.

  • Model checkpointing and W&B integration.