Qwen3-VL#

Qwen3-VL is the latest generation of vision-language models from Alibaba Cloud, supporting multimodal understanding across text, images, and videos. Qwen3-VL includes both dense models and Mixture-of-Experts (MoE) variants for improved efficiency.

NeMo Megatron Bridge supports finetuning Qwen3-VL models (8B dense and 30B MoE variants).

Tip

We use the following environment variables throughout this page

HF_MODEL_PATH=Qwen/Qwen3-VL-8B-Instruct (or Qwen/Qwen3-VL-30B-A3B-Instruct for MoE)
MEGATRON_MODEL_PATH=/models/Qwen3-VL-8B-Instruct (feel free to set your own path) Unless explicitly stated, any megatron model path in the commands below should NOT contain the iteration number iter_xxxxxx. For more details on checkpointing, please see here

Examples#

For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the Qwen3-VL Examples.

Hugging Face Model Cards#

Qwen3-VL-8B: https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct
Qwen3-VL-30B-A3B (MoE): https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct
Qwen3-VL-235B-A22B (MoE): https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct