Qwen2.5-VL#
Qwen2.5-VL is a series of vision-language models developed by Alibaba Cloud that enable multimodal understanding across text, images, and videos. The models support various vision-language tasks including image understanding, visual question answering, and multimodal reasoning.
NeMo Megatron Bridge supports finetuning Qwen2.5-VL models (3B, 7B, 32B, and 72B variants) on single-image and multi-image datasets. The finetuned model can be converted back to the 🤗 Hugging Face format for downstream evaluation.
Tip
We use the following environment variables throughout this page
HF_MODEL_PATH=Qwen/Qwen2.5-VL-3B-Instruct(it can also be set toQwen/Qwen2.5-VL-7B-Instruct,Qwen/Qwen2.5-VL-32B-Instruct,Qwen/Qwen2.5-VL-72B-Instruct)MEGATRON_MODEL_PATH=/models/Qwen2.5-VL-3B-Instruct(feel free to set your own path)
Unless explicitly stated, any megatron model path in the commands below should NOT contain the iteration number
iter_xxxxxx. For more details on checkpointing, please see
here
Conversion with 🤗 Hugging Face#
Import HF → Megatron#
To import the HF model to your desired $MEGATRON_MODEL_PATH, run the following command.
python examples/conversion/convert_checkpoints.py import \
--hf-model $HF_MODEL_PATH \
--megatron-path $MEGATRON_MODEL_PATH
Export Megatron → HF#
You can export a trained model with the following command.
python examples/conversion/convert_checkpoints.py export \
--hf-model $HF_MODEL_PATH \
--megatron-path <trained megatron model path> \
--hf-path <output hf model path>
Run In-Framework Inference on Converted Checkpoint#
You can run a quick sanity check on the converted checkpoint with the following command.
python examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path $HF_MODEL_PATH \
--megatron_model_path $MEGATRON_MODEL_PATH \
--image_path <example image path> \
--prompt "Describe this image." \
--max_new_tokens 100
Note:
--megatron_model_pathis optional. If not specified, the script will convert the model and then run forward. If specified, the script will just load the megatron model--max_new_tokenscontrols the number of tokens to generate.You can also use image URLs:
--image_path="https://example.com/image.jpg"
Finetuning Recipes#
Before training, ensure the following environment variables are set.
SAVE_DIR: to specify a checkpoint and log saving directory, used in the commands below.HF_TOKEN: to download models from HF Hub (if required).HF_HOME: (optional) to avoid re-downloading models and datasets every time.WANDB_API_KEY: (optional) to enable WandB logging.
Full Finetuning#
Example usage for full parameter finetuning:
torchrun --nproc-per-node=8 examples/recipes/qwen_vl/finetune_qwen25_vl.py \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--recipe qwen25_vl_3b_finetune_config \
--dataset-type hf \
dataset.maker_name=make_cord_v2_dataset \
train.global_batch_size=<batch size> \
train.train_iters=<number of iterations> \
logger.wandb_project=<optional wandb project name> \
logger.wandb_save_dir=$SAVE_DIR \
checkpoint.save=$SAVE_DIR/<experiment name>
Note:
The
--recipeparameter selects the model size configuration. Available options:qwen25_vl_3b_finetune_config- for 3B modelqwen25_vl_7b_finetune_config- for 7B modelqwen25_vl_32b_finetune_config- for 32B modelqwen25_vl_72b_finetune_config- for 72B model
The config file
examples/recipes/qwen_vl/conf/qwen25_vl_pretrain_override_example.yamlcontains a list of arguments that can be overridden in the command. For example, you can settrain.global_batch_size=<batch size>in the command.The dataset format should be JSONL with conversation format (see dataset section below).
After training, you can run inference with
hf_to_megatron_generate_vlm.pyby supplying the trained megatron checkpoint. You can also export the trained checkpoint to Hugging Face format.
Parameter-Efficient Finetuning (PEFT)#
Parameter-efficient finetuning (PEFT) using LoRA or DoRA is supported. You can use the --peft_scheme argument to enable PEFT training:
torchrun --nproc-per-node=8 examples/recipes/qwen_vl/finetune_qwen25_vl.py \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--recipe qwen25_vl_3b_finetune_config \
--peft_scheme lora \
--dataset-type hf \
dataset.maker_name=make_cord_v2_dataset \
train.global_batch_size=<batch size> \
checkpoint.save=$SAVE_DIR/<experiment name>
PEFT options:
--peft_scheme: Set tolorafor LoRA (Low-Rank Adaptation) ordorafor DoRA (Weight-Decomposed Low-Rank Adaptation). Set toNoneor omit for full finetuning.
You can also combine PEFT with freeze options to control which components are trainable:
model.freeze_language_model: Set toTrueto freeze the language modelmodel.freeze_vision_model: Set toTrueto freeze the vision encodermodel.freeze_vision_projection: Set toTrueto freeze the vision projection layer
Example with LoRA and freeze options:
torchrun --nproc-per-node=8 examples/recipes/qwen_vl/finetune_qwen25_vl.py \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--recipe qwen25_vl_3b_finetune_config \
--peft_scheme lora \
model.freeze_language_model=True \
model.freeze_vision_model=False \
model.freeze_vision_projection=False \
checkpoint.save=$SAVE_DIR/<experiment name>
Example Datasets#
Megatron Bridge supports various vision-language dataset examples which can be used to finetune Qwen 2.5 VL:
Dataset |
Maker Name |
Description |
|---|---|---|
|
OCR receipts: Single-image-text dataset for receipt understanding, outputs xml-like annotated text. |
|
|
Medical VQA: Single-image question-answer dataset covering clinical medical images and free-form answers. |
|
|
Visual reasoning: Multi-image, vision reasoning dataset for analogical reasoning in different visual layouts. |
To change the dataset, specify dataset.maker_name=make_raven_dataset
Hugging Face Model Cards#
Qwen2.5-VL-3B:
https://huggingface.co/Qwen/Qwen2.5-VL-3B-InstructQwen2.5-VL-7B:
https://huggingface.co/Qwen/Qwen2.5-VL-7B-InstructQwen2.5-VL-32B:
https://huggingface.co/Qwen/Qwen2.5-VL-32B-InstructQwen2.5-VL-72B:
https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct