i.. _2-qwen2vl:

Qwen2-VL#

Qwen2-VL is the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of SoTA understanding of images of various resolution & ratio, understanding videos of 20min+, multilingual support and Agent that can operate your mobiles, robots, etc.

Import from Hugging Face to NeMo 2.0#

To import the Hugging Face (HF) model and convert it to NeMo 2.0 format, run the following command. This step only needs to be performed once:

from nemo.collections.llm import import_ckpt
from nemo.collections import vlm

if __name__ == '__main__':
  # Specify the Hugging Face model ID
    hf_model_id = "Qwen/Qwen2-VL-2B-Instruct"

    # Import the model and convert to NeMo 2.0 format
    import_ckpt(
        model=vlm.Qwen2VLModel(vlm.Qwen2VLConfig2B()),  # Model configuration
        source=f"hf://{hf_model_id}",  # Hugging Face model source
    )

The command above saves the converted file in the NeMo cache folder, located at: ~/.cache/nemo.

If needed, you can change the default cache directory by setting the NEMO_CACHE_DIR environment variable before running the script.

NeMo 2.0 Qwen2-VL Scripts#

The scripts for working with Qwen2-VL models within the NeMo Framework are located in scripts/vlm.

qwen2vl_finetune.py: Fine-tunes a Qwen2-VL model on a given dataset.

Usage:

torchrun --nproc_per_node=2 scripts/vlm/qwen2vl_finetune.py \
  --devices=2 --tp=2 --data_type=mock \
  --mbs=1 --gbs=4

qwen2vl_generate.py: Performs inference (generation) using a fine-tuned or pre-converted Qwen2-VL NeMo 2.0 model.

Usage:
```
python scripts/vlm/qwen2vl_generate.py --load_from_hf
```

NeMo 2.0 Fine-Tuning Recipes#

We provide pre-defined recipes for fine-tuning Qwen2-VL using NeMo 2.0 and NeMo-Run. These recipes configure a run.Partial for one of the nemo.collections.llm api functions introduced in NeMo 2.0. The recipes are hosted in qwen2vl_2b and qwen2vl_7b files. The recipes use mock dataset for training.

Note

The recipes use the MockDataModule for the data argument. You are expected to replace the MockDataModule with your custom dataset.

By default, the non-instruct version of the model is loaded. To load a different model, set finetune.resume.restore_config.path=nemo://<hf_model_id> or finetune.resume.restore_config.path=<local_model_path>.

We provide an example below on how to invoke the default recipe and override the data argument:

from nemo.collections import vlm

finetune = vlm.qwen2vl_2b.finetune_recipe(
    name="qwen2vl_2b_finetune",
    dir=f"/path/to/checkpoints",
    num_nodes=1,
    num_gpus_per_node=8,
    peft_scheme='lora',  # 'lora', 'none'
)

By default, the fine-tuning recipe applies LoRA to all linear layers in the language model, including cross-attention layers, while keeping the vision model unfrozen.

To configure which layers to apply LoRA: Set finetune.peft.target_modules. For example, to apply LoRA only on the self-attention qkv projection layers, set finetune.peft.target_modules=["*.language_model.*.linear_qkv"].
To freeze the vision model: Set finetune.peft.freeze_vision_model=True.
To fine-tune the entire model without LoRA: Set peft_scheme='none' in the recipe argument.

Note

The configuration in the recipes is done using the NeMo-Run run.Config and run.Partial configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.

Once you have your final configuration ready, you can execute it on any of the NeMo-Run supported executors. The simplest is the local executor, which just runs the pretraining locally in a separate process. You can use it as follows:

import nemo_run as run

run.run(finetune, executor=run.LocalExecutor())

Additionally, you can also run it directly in the same Python process as follows:

run.run(finetune, direct=True)

Bring Your Own Data#

Replace the MockDataModule in default recipes with your custom dataset.

from nemo.collections import vlm
from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer
from transformers import Qwen2VLImageProcessor

# Define the fine-tuning recipe
finetune = vlm.qwen2vl_2b.finetune_recipe(
      name="qwen2vl_2b_finetune",
      dir=f"/path/to/checkpoints",
      num_nodes=1,
      num_gpus_per_node=8,
      peft_scheme='none',  # 'lora', 'none'
  )

# The following is an example of a custom dataset configuration.
data_config = vlm.Qwen2VLDataConfig(
    image_folder="/path/to/images",
    conv_template="qwen2vl",  # Customize based on your dataset needs
)

# Data module setup
tokenizer = AutoTokenizer('Qwen/Qwen2-VL-2B')
image_processor = Qwen2VLImageProcessor()
custom_data = vlm.Qwen2VLPreloadedDataModule(
    paths="/path/to/dataset.json",  # Path to your dataset
    data_config=data_config,
    seq_length=4096,
    global_batch_size=16,  # Global batch size
    micro_batch_size=1,  # Micro batch size
    tokenizer=tokenizer,  # Define your tokenizer if needed
    image_processor=image_processor,  # Add an image processor if required
    num_workers=8,  # Number of workers for data loading
)

# Assign custom data to the fine-tuning recipe
finetune.data = custom_data

Use the Energon Dataloader#

The Energon Dataloader can be used with Qwen2-VL to handle multimodal datasets for training. This section explains how to set up and customize the dataloader, highlighting key components such as the task_encoder and multimodal_sample_config. For details on preparing data to use with the data module, refer to data preparation section.

Data Preparation#

Create a Webdataset#

python nemo/collections/vlm/qwen2vl/data/convert_to_qwen2vl_wds.py \
    --dataset-root=<dataset root dir> \
    --json=<conversation json file> \
    --max-samples-per-tar=1000

Prepare the Dataset#

Run ‘energon prepare <webdataset dir>’ to prepare dataset, select ‘Crude sample’ for sample types.

Example Code#

Below is an example of how to use the Energon Dataloader with NeVA for training:

from nemo.collections.multimodal.data.energon import EnergonMultiModalDataModule
from transformers import AutoProcessor

# Load tokenizer, and image processor from pre-trained model
tokenizer = AutoTokenizer("Qwen/Qwen2-VL-2B-Instruct")
image_processor = Qwen2VLImageProcessor()

# Define dataset path
dataset_path = "<path_to_dataset>"

# Initialize the data module
data = EnergonMultiModalDataModule(
    path=dataset_path,
    tokenizer=tokenizer,
    image_processor=image_processor,
    seq_length=max_sequence_length,
    micro_batch_size=mbs,
    global_batch_size=gbs,
    num_workers=1,
    task_encoder=Qwen2VLTaskEncoder(
        tokenizer=tokenizer,
        image_processor=image_processor,
        max_padding_length=int(max_sequence_length * 0.9),
    ),
)

# Note: `EnergonMultiModalDataModule` defaults to `MultiModalTaskEncoder` if no custom task encoder is provided.

A comprehensive list of recipes that we currently support or plan to support soon is provided below for reference:

Recipe	Status
Qwen2-VL 2B LoRA	Yes
Qwen2-VL 2B Full fine-tuning	Yes
Qwen2-VL 7B LoRA	Yes
Qwen2-VL 7B Full fine-tuning	Yes