Qwen2.5-Omni
Qwen2.5-Omni
Qwen2.5-Omni is Alibaba Cloud’s omnimodal model supporting text, image, audio, and video inputs in a single unified architecture with a dense language backbone. NeMo AutoModel onboards the Thinker stack for audio understanding tasks such as automatic speech recognition (ASR).
Available Models
- Qwen2.5-Omni-3B: 3B dense backbone
- Qwen2.5-Omni-7B: 7B dense backbone
Architecture
The registry wires the Qwen2.5-Omni Thinker backbone under the following architecture keys:
Qwen2_5OmniForConditionalGenerationQwen2_5OmniModelQwen2_5OmniThinkerForConditionalGeneration
Example HF Models
Example Recipes
Try with NeMo AutoModel
1. Install (full instructions):
2. Clone the repo to get the example recipes:
3. Run the recipe from inside the repo:
Run with Docker
1. Pull the container and mount a checkpoint directory:
2. Navigate to the AutoModel directory (where the recipes are):
3. Run the recipe:
See the Installation Guide and Omni Fine-Tuning Guide.
Fine-Tuning
See the VLM / Omni Fine-Tuning Guide.