Omni Models#

Omni models go beyond image-text understanding to support additional modalities such as audio, video, or a combination of all — text, image, audio, and video in a single unified model.

Run Omni Models with NeMo AutoModel#

To run omni models with NeMo AutoModel, use NeMo container version 25.11.00 or later. If the model you want to fine-tune requires a newer version of Transformers, you may need to upgrade:

pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git

For other installation options, see our Installation Guide.

Supported Models#

Owner	Model	Modalities	Architecture
Qwen / Alibaba Cloud	Qwen3-Omni	Text · Image · Audio · Video	`Qwen3OmniForConditionalGeneration`
Microsoft	Phi-4-multimodal	Text · Image · Audio	`Phi4MultimodalForCausalLM`

Fine-Tuning#

All supported omni models can be fine-tuned using full SFT or PEFT (LoRA) approaches. See the VLM Fine-Tuning Guide for general setup instructions.