Qwen2.5-VL#

Qwen2.5-VL is a series of vision-language models developed by Alibaba Cloud that enable multimodal understanding across text, images, and videos. The models support various vision-language tasks including image understanding, visual question answering, and multimodal reasoning.

NeMo Megatron Bridge supports finetuning Qwen2.5-VL models (3B, 7B, 32B, and 72B variants) on single-image and multi-image datasets. The finetuned model can be converted back to the 🤗 Hugging Face format for downstream evaluation.

Tip

We use the following environment variables throughout this page

HF_MODEL_PATH=Qwen/Qwen2.5-VL-3B-Instruct (it can also be set to Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen2.5-VL-32B-Instruct, Qwen/Qwen2.5-VL-72B-Instruct)
MEGATRON_MODEL_PATH=/models/Qwen2.5-VL-3B-Instruct (feel free to set your own path)

Unless explicitly stated, any megatron model path in the commands below should NOT contain the iteration number iter_xxxxxx. For more details on checkpointing, please see here

Conversion with 🤗 Hugging Face#

Import HF → Megatron#

To import the HF model to your desired $MEGATRON_MODEL_PATH, run the following command.

python examples/conversion/convert_checkpoints.py import \
--hf-model $HF_MODEL_PATH \
--megatron-path $MEGATRON_MODEL_PATH

Export Megatron → HF#

You can export a trained model with the following command.

python examples/conversion/convert_checkpoints.py export \
--hf-model $HF_MODEL_PATH \
--megatron-path <trained megatron model path> \
--hf-path <output hf model path>

Run In-Framework Inference on Converted Checkpoint#

You can run a quick sanity check on the converted checkpoint with the following command.

python examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path $HF_MODEL_PATH \
--megatron_model_path $MEGATRON_MODEL_PATH \
--image_path <example image path> \
--prompt "Describe this image." \
--max_new_tokens 100

Note:

--megatron_model_path is optional. If not specified, the script will convert the model and then run forward. If specified, the script will just load the megatron model
--max_new_tokens controls the number of tokens to generate.
You can also use image URLs: --image_path="https://example.com/image.jpg"

Finetuning Recipes#

Before training, ensure the following environment variables are set.

SAVE_DIR: to specify a checkpoint and log saving directory, used in the commands below.
HF_TOKEN: to download models from HF Hub (if required).
HF_HOME: (optional) to avoid re-downloading models and datasets every time.
WANDB_API_KEY: (optional) to enable WandB logging.

Full Finetuning#

Example usage for full parameter finetuning:

torchrun --nproc-per-node=8 examples/recipes/qwen_vl/finetune_qwen25_vl.py \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--recipe qwen25_vl_3b_finetune_config \
--dataset-type hf \
dataset.maker_name=make_cord_v2_dataset \
train.global_batch_size=<batch size> \
train.train_iters=<number of iterations> \
logger.wandb_project=<optional wandb project name> \
logger.wandb_save_dir=$SAVE_DIR \
checkpoint.save=$SAVE_DIR/<experiment name>

Note:

The --recipe parameter selects the model size configuration. Available options:
- qwen25_vl_3b_finetune_config - for 3B model
- qwen25_vl_7b_finetune_config - for 7B model
- qwen25_vl_32b_finetune_config - for 32B model
- qwen25_vl_72b_finetune_config - for 72B model
The config file examples/recipes/qwen_vl/conf/qwen25_vl_pretrain_override_example.yaml contains a list of arguments that can be overridden in the command. For example, you can set train.global_batch_size=<batch size> in the command.
The dataset format should be JSONL with conversation format (see dataset section below).
After training, you can run inference with hf_to_megatron_generate_vlm.py by supplying the trained megatron checkpoint. You can also export the trained checkpoint to Hugging Face format.

Parameter-Efficient Finetuning (PEFT)#

Parameter-efficient finetuning (PEFT) using LoRA or DoRA is supported. You can use the --peft_scheme argument to enable PEFT training:

torchrun --nproc-per-node=8 examples/recipes/qwen_vl/finetune_qwen25_vl.py \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--recipe qwen25_vl_3b_finetune_config \
--peft_scheme lora \
--dataset-type hf \
dataset.maker_name=make_cord_v2_dataset \
train.global_batch_size=<batch size> \
checkpoint.save=$SAVE_DIR/<experiment name>

PEFT options:

--peft_scheme: Set to lora for LoRA (Low-Rank Adaptation) or dora for DoRA (Weight-Decomposed Low-Rank Adaptation). Set to None or omit for full finetuning.

You can also combine PEFT with freeze options to control which components are trainable:

model.freeze_language_model: Set to True to freeze the language model
model.freeze_vision_model: Set to True to freeze the vision encoder
model.freeze_vision_projection: Set to True to freeze the vision projection layer

Example with LoRA and freeze options:

torchrun --nproc-per-node=8 examples/recipes/qwen_vl/finetune_qwen25_vl.py \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--recipe qwen25_vl_3b_finetune_config \
--peft_scheme lora \
model.freeze_language_model=True \
model.freeze_vision_model=False \
model.freeze_vision_projection=False \
checkpoint.save=$SAVE_DIR/<experiment name>

Example Datasets#

Megatron Bridge supports various vision-language dataset examples which can be used to finetune Qwen 2.5 VL:

Dataset	Maker Name	Description
cord-v2	`make_cord_v2_dataset`	OCR receipts: Single-image-text dataset for receipt understanding, outputs xml-like annotated text.
MedPix-VQA	`make_medpix_dataset`	Medical VQA: Single-image question-answer dataset covering clinical medical images and free-form answers.
The Cauldron (Raven subset)	`make_raven_dataset`	Visual reasoning: Multi-image, vision reasoning dataset for analogical reasoning in different visual layouts.

To change the dataset, specify dataset.maker_name=make_raven_dataset

Hugging Face Model Cards#

Qwen2.5-VL-3B: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
Qwen2.5-VL-7B: https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
Qwen2.5-VL-32B: https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct
Qwen2.5-VL-72B: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct