Omni Models#

Omni models go beyond image-text understanding to support additional modalities such as audio, video, or a combination of all — text, image, audio, and video in a single unified model.

Run Omni Models with NeMo AutoModel#

To run omni models with NeMo AutoModel, use NeMo container version 25.11.00 or later. If the model you want to fine-tune requires a newer version of Transformers, you may need to upgrade:

pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git

For other installation options, see our Installation Guide.

Supported Models#

Owner

Model

Modalities

Architecture

Qwen / Alibaba Cloud

Qwen3-Omni

Text · Image · Audio · Video

Qwen3OmniForConditionalGeneration

Microsoft

Phi-4-multimodal

Text · Image · Audio

Phi4MultimodalForCausalLM

Fine-Tuning#

All supported omni models can be fine-tuned using full SFT or PEFT (LoRA) approaches. See the VLM Fine-Tuning Guide for general setup instructions.