Omni Models

View as Markdown

Omni models go beyond image-text understanding to support additional modalities such as audio, video, or a combination of all — text, image, audio, and video in a single unified model.

Run Omni Models with NeMo AutoModel

To run omni models with NeMo AutoModel, use NeMo container version 26.06.00 or later. If the model you want to fine-tune requires a newer version of Transformers, you may need to upgrade:

$pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git

For other installation options, see our NeMo AutoModel Installation Guide.

Supported Models

OwnerModelModalitiesArchitecture
Qwen / Alibaba CloudQwen3-OmniText · Image · Audio · VideoQwen3OmniForConditionalGeneration
Qwen / Alibaba CloudQwen2.5-OmniText · Image · Audio · VideoQwen2_5OmniForConditionalGeneration
MicrosoftPhi-4-multimodalText · Image · AudioPhi4MultimodalForCausalLM
NVIDIANemotron-3-Nano-OmniText · Image · AudioNemotronH_Nano_Omni_Reasoning_V3

Fine-Tune Omni Models

All supported omni models can be fine-tuned using full SFT or PEFT (LoRA) approaches. See the VLM Fine-Tuning Guide for general setup instructions.