Nemotron 3 Omni Training Recipe#

Multimodal post-training pipeline for Nemotron 3 Nano Omni, NVIDIA’s 30B-A3B hybrid mixture-of-experts model. Unlike Nano3 and Super3, Omni starts from a GA checkpoint and owns its stage-local container builds.

Read the blog?#

Quick navigation for developers landing here from the release blog:

You want…	Go to
Architecture deep-dive (Mamba+transformer, EVS, NemoClaw)	`architecture.md`
Inference engines, quantization, cloud platforms, providers	`inference.md`
Reproduce training	§Quick Start below
SFT details	`sft.md`
RL details	`rl.md` · `rl/data-prep.md`

Model Overview#

Nemotron 3 Nano Omni hybrid MoE architecture: Parakeet audio encoder → audio adaptor, C-RADIOv4-H vision encoder + 3D convolution + Efficient Video Sampling → vision adaptor, text tokenizer, all feeding the unified 30B-A3B LLM decoder

Property	Value
Architecture	Hybrid MoE — Mamba layers (sequence/memory efficiency) + transformer layers (reasoning), with a unified text decoder (details)
Total / active parameters	30B / 3B (A3B MoE)
Native modalities	Text, image, video, audio
Max context length	262K tokens
Training context schedule	16K → 49K → 262K (progressive scaling, see §Progressive context scaling)
Vision encoder	C-RADIOv4-H
Audio encoder	NVIDIA Parakeet (extended via Granary, Music Flamingo)
Video pipeline	3D convolutions + Efficient Video Sampling (EVS)
GA checkpoint	`nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16`
License	NVIDIA Nemotron Open Model License (enterprise-friendly, on-prem and any deployment)

For the full architectural deep-dive — including why EVS drives the throughput numbers, the perception-sub-agent framing, and the released training-data scale (127B / 124M / 20×25 / 11.4M) — see architecture.md.

Capabilities (released benchmarks)#

Best-in-class on MMlongbench-Doc and OCRBenchV2 (document intelligence)
Leading on WorldSense, DailyOmni, VoiceBench (video / audio understanding)
~9.2× greater effective system capacity on video reasoning, ~7.4× on multi-document workloads vs. comparable open omni models
“Highest throughput across every task” in MediaPerf; “lowest inference cost for video-level tagging”
See inference.md for engines, quantization paths, hardware support, and deployment

Use cases#

Designed as a multimodal perception sub-agent for agentic AI — finance, healthcare, scientific discovery, media/entertainment, ad-tech. Especially strong on long-horizon reasoning over complex documents and large video batches. The NemoClaw sandbox provides privacy-first video processing for compliance-bounded workloads.

Upstream training recipes#

This recipe folder is the cookbook view; upstream sources are:

Stage	Upstream guide	Branch root
SFT (Megatron-Bridge)	`nemotron_3_omni` README	`Megatron-Bridge` `nemotron_3_omni`
RL (NeMo-RL)	`nano-v3-omni` Nemotron 3 Nano Omni guide	`NeMo-RL` `nano-v3-omni`
Evaluation	Same Megatron-Bridge path	(above)
Image training data	—	`huggingface.co/datasets/nvidia/Nemotron-Image-Training-v3`
Long-document SDG	Long-document SDG guide	`src/nemotron/recipes/data/sdg/long-document/` (structure released; bodies port at upstream release)

Current Limitations#

The recipe structure, CLI, Dockerfiles, and configs are all in place, but some pieces still depend on upstream work or internal-only data:

Evaluation stage not yet included — coming soon. The omni3 CLI doesn’t have an eval subcommand today. Multimodal sanity checks against a trained checkpoint run via nemotron omni3 model eval; see SFT guide §Model lifecycle. A dedicated eval stage that compiles a benchmark task list and submits through nemo-evaluator-launcher will land in a follow-up release.
SFT default uses CORD-v2 (open); Valor32k is an opt-in. default.yaml pulls CORD-v2 from HuggingFace via the vlm-hf loader — no local shard building required. -c valor32k switches to the full audio-visual-language flow, but requires internal access to a prepared Valor32k-AVQA Energon dataset (the raw-shard builder is internal-only at release). data_prep.py validates the Energon path or emits a manifest for HF — it does not assemble shards.
Long-document SDG pipeline is a scaffold. src/nemotron/recipes/data/sdg/long-document/ has 9 numbered scripts with their argparse surfaces and a thorough README, but the script bodies raise NotImplementedError — port bodies from upstream at release time. See designs/long-document-sdg-pipeline.md.
Pinned to release branches. The SFT Dockerfile pulls github.com/NVIDIA-NeMo/Megatron-Bridge @ nemotron_3_omni (and github.com/NVIDIA/Megatron-LM @ nemotron_3_omni as a recursive submodule fetch); the RL flow uses github.com/NVIDIA/NeMo-RL @ nano-v3-omni. These are the active release branches for Nemotron 3 Omni — bump to a versioned tag (or main) once these changes merge upstream.

The linked stage guides (SFT / RL / Evaluate) call out each stage’s specific limitations.

Quick Start#

Prerequisites#

Slurm cluster with GPU nodes for SFT, RL, and evaluation — see Execution through NeMo-Run
Weights & Biases for experiment tracking and artifact lineage
Build-capable execution profile for nemotron omni3 build <stage> jobs
GA checkpoint: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 (BF16 weights). FP8 and NVFP4 quantization paths are supported by the inference engines listed in inference.md.

Installation#

git clone https://github.com/NVIDIA/nemotron
cd nemotron
uv sync

Configuration#

Create an env.toml profile with both training and build settings:

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
run_partition = "interactive"
# Build-context overrides — used by `omni3 build sft|rl` only. The build
# is CPU-only and writes its OCI archive to `build_cache_dir`, which
# must be a cluster-visible (e.g. Lustre) path because the dispatcher
# mounts it into the build container.
build_partition = "cpu"
build_time = "02:00:00"
build_cache_dir = "/lustre/.../users/YOUR-USER/.cache/nemotron"
nodes = 1
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]

See How container builds authenticate for how the dispatcher reuses your existing ~/.config/enroot/.credentials to pull nvcr.io images inside the build container — no separate podman login required.

Run the Pipeline#

// Stage 0: SFT
$ uv run nemotron omni3 build sft --run YOUR-CLUSTER
$ uv run nemotron omni3 data prep sft --run YOUR-CLUSTER
$ uv run nemotron omni3 model import pretrain --run YOUR-CLUSTER \
    --hf-model nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 \
    --megatron-path /checkpoints/nemotron_omni
$ uv run nemotron omni3 sft --run YOUR-CLUSTER

// Stage 1.1: RL MPO
$ uv run nemotron omni3 build rl --run YOUR-CLUSTER
$ uv run nemotron omni3 data prep rl -c mpo --run YOUR-CLUSTER
$ uv run nemotron omni3 rl mpo --run YOUR-CLUSTER

// Stage 1.2: RL text
$ uv run nemotron omni3 data prep rl -c text --run YOUR-CLUSTER
$ uv run nemotron omni3 rl text --run YOUR-CLUSTER

// Stage 1.3: RL vision
$ uv run nemotron omni3 data prep rl -c vision --run YOUR-CLUSTER
$ uv run nemotron omni3 rl vision --run YOUR-CLUSTER

Note: omni3 build submits a short CPU-only container build job through NeMo-Run. The build produces ${build_cache_dir}/containers/omni3-{sft,rl}.sqsh (squashfs) which pyxis mounts directly at training time — no per-job enroot import.

Resources#

Model checkpoint: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Release blog: NVIDIA Nemotron 3 Nano Omni — multimodal agent reasoning
SFT recipe: Stage 0: SFT
RL recipe: Stage 1: RL
Image training data: nvidia/Nemotron-Image-Training-v3

Training Pipeline#

Stage	Name	Purpose	Guide
0	SFT	Fine-tune the GA checkpoint on Valor32k and related multimodal variants	sft.md
1	RL	Multi-stage Omni RL: MPO → text RL → vision RL	rl.md

An evaluation stage (nemotron omni3 eval) is on the roadmap; until it lands, run benchmarks via nemotron omni3 model eval (see SFT guide §Model lifecycle).

Pipeline Overview#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart LR
    sdg["data/sdg/long-document<br/>synthetic document data"] --> prep["omni3 data prep sft"]
    ga["GA HF checkpoint"] --> import["omni3 model import pretrain"]
    prep --> sft["omni3 sft"]
    import --> sft
    sft --> mpo["omni3 rl mpo"]
    mpo --> text["omni3 rl text"]
    text --> vision["omni3 rl vision"]

    style sdg fill:#e3f2fd,stroke:#2196f3
    style sft fill:#f3e5f5,stroke:#9c27b0
    style mpo fill:#e8f5e9,stroke:#4caf50
    style text fill:#e8f5e9,stroke:#4caf50
    style vision fill:#e8f5e9,stroke:#4caf50

The upstream synthetic-data pipeline lives outside the family tree under src/nemotron/recipes/data/sdg/long-document/. That keeps the recipe split explicit:

src/nemotron/recipes/data/curation/ — filter, dedup, and curate existing corpora (for example Nemotron-CC)
src/nemotron/recipes/data/sdg/ — generate new datasets, including the long-document SDG pipeline consumed by Omni SFT
src/nemotron/recipes/omni3/ — family-specific training, RL, and evaluation stages

Stage Summaries#

Stage 0: SFT#

Omni SFT owns its own Dockerfile, data_prep.py, and train.py (built via the nemotron omni3 build sft dispatcher). The stage ports the Valor32k flow plus LoRA/PEFT variants for image-text and audio-text tuning. The released open-data configs target shorter sequence lengths; the upstream 16K → 49K → 262K progressive schedule is documented in architecture.md.

→ SFT Guide

Stage 1: RL#

The RL stack uses one shared NeMo-RL container and three sub-stages, mirroring the upstream nano-v3-omni flow:

MPO — multimodal preference optimization on the public MMPR dataset (~83K-row preference triples per subset)
Text RL — GRPO continuation of alignment on nvidia/Nemotron-3-Nano-RL-Training-Blend
Vision RL — GRPO on MMPR-Tiny (data prep ready; training launcher pending upstream)

The 20 RL datasets / 25 environments / ~2.3M rollouts referenced in the release blog compose the full upstream alignment corpus; this recipe surfaces the public/open-source subset.

→ RL Guide · RL Data Prep

Execution Options#

All Omni commands support NeMo-Run execution modes:

Option	Behavior	Use Case
`--run <profile>`	Attached—submits job and streams logs	Interactive development
`--batch <profile>`	Detached—submits and exits immediately	Long-running jobs
`--dry-run`	Preview resolved config or build plan	Validation

CLI Reference#

$ uv run nemotron omni3 --help
Usage: nemotron omni3 [OPTIONS] COMMAND [ARGS]...

 Omni3 training recipe

╭─ Infrastructure ─────────────────────────────────────────────────────────╮
│ build      Build an Omni3 stage container via podman on the cluster.     │
╰───────────────────────────────────────────────────────────────────────────╯
╭─ Commands ────────────────────────────────────────────────────────────────╮
│ data       Data preparation commands                                      │
│ model      Model import/export/eval commands                              │
│ rl         Reinforcement learning sub-stages                              │
╰───────────────────────────────────────────────────────────────────────────╯
╭─ Training Stages ─────────────────────────────────────────────────────────╮
│ sft        Run omni3 supervised fine-tuning.                              │
╰───────────────────────────────────────────────────────────────────────────╯