Nemotron-3 Nano Omni#
NVIDIA Nemotron-3 Nano Omni is a 30B-A3B MoE multimodal reasoning model that jointly processes image, video, audio, and text inputs. It pairs a MoE Mamba/attention hybrid language backbone with a RADIO vision tower (static-resolution image path or dynamic-resolution temporal video embedder) and a Parakeet sound encoder.
NeMo Megatron Bridge supports HF↔Megatron conversion, full SFT, and LoRA PEFT on image-text and audio-video-text datasets. The finetuned model can be re-exported to 🤗 Hugging Face format for downstream evaluation or deployment.
Important
Day-0 release. This model is supported on dedicated public branches:
Repo |
Branch |
|---|---|
Megatron-Bridge |
|
Megatron-LM (submodule at |
For the full setup, conversion, inference, training, evaluation, and LoRA
merge / adapter export workflows, see
examples/models/vlm/nemotron_3_omni/README.md.