Omni Models#
Omni models go beyond image-text understanding to support additional modalities such as audio, video, or a combination of all — text, image, audio, and video in a single unified model.
Run Omni Models with NeMo AutoModel#
To run omni models with NeMo AutoModel, use NeMo container version 25.11.00 or later. If the model you want to fine-tune requires a newer version of Transformers, you may need to upgrade:
pip3 install --upgrade git+git@github.com:NVIDIA-NeMo/AutoModel.git
For other installation options, see our Installation Guide.
Supported Models#
Owner |
Model |
Modalities |
Architecture |
|---|---|---|---|
Qwen / Alibaba Cloud |
Text · Image · Audio · Video |
|
|
Microsoft |
Text · Image · Audio |
|
Fine-Tuning#
All supported omni models can be fine-tuned using full SFT or PEFT (LoRA) approaches. See the VLM Fine-Tuning Guide for general setup instructions.