Qwen3-VL#
Qwen3-VL is the latest generation of vision-language models from Alibaba Cloud, supporting multimodal understanding across text, images, and videos. Qwen3-VL includes both dense models and Mixture-of-Experts (MoE) variants for improved efficiency.
NeMo Megatron Bridge supports finetuning Qwen3-VL models (8B dense and 30B MoE variants).
Tip
We use the following environment variables throughout this page
HF_MODEL_PATH=Qwen/Qwen3-VL-8B-Instruct(orQwen/Qwen3-VL-30B-A3B-Instructfor MoE)MEGATRON_MODEL_PATH=/models/Qwen3-VL-8B-Instruct(feel free to set your own path) Unless explicitly stated, any megatron model path in the commands below should NOT contain the iteration numberiter_xxxxxx. For more details on checkpointing, please see here
Examples#
For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the Qwen3-VL Examples.
Hugging Face Model Cards#
Qwen3-VL-8B:
https://huggingface.co/Qwen/Qwen3-VL-8B-InstructQwen3-VL-30B-A3B (MoE):
https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-InstructQwen3-VL-235B-A22B (MoE):
https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct