Stage 1: Reinforcement Learning (RL)#

Omni RL continues the multimodal post-training pipeline with NeMo-RL using one shared container and three explicit sub-stages. RL aligns the model’s perception-sub-agent surface — preference quality on visual reasoning, factual grounding for downstream agent calls, and ASR fidelity. The full upstream alignment corpus runs ~2.3M rollouts across 20 RL datasets / 25 environments covering visual grounding, charts, vision-critical STEM, video understanding, and ASR (per the release blog). This recipe folder surfaces the 3 of 25 environments that have public data: MMPR (preference), the Nano RL training blend (text), and MMPR-Tiny (vision); the remaining 22 environments use internal or third-party data and aren’t included.

Shared container: All RL sub-stages use src/nemotron/recipes/omni3/stage1_rl/Dockerfile. The nemotron omni3 build rl dispatcher turns it into omni3-rl.sqsh under your build_cache_dir.

Current notes (also summarized in the family README):

All three sub-stages have working launchers that mirror the upstream NeMo-RL nano-v3-omni flow (scripts/nanov3_mpo.sh, scripts/nanov3_text_rl.sh, scripts/nanov3_vision_rl.sh).

stage1_rl/Dockerfile mirrors NeMo-RL’s release Dockerfile — clones NVIDIA/NeMo-RL @ nano-v3-omni recursively (carrying the omni vllm fork as a 3rdparty/vllm submodule) and runs the same BUILD_CUSTOM_VLLM=1 + uv sync flow as the upstream docker/Dockerfile. omni3 build rl produces omni3-rl.sqsh.

nano-v3-omni is the active release branch for Nemotron 3 Omni; bump to a versioned tag (or main) once these changes merge upstream.

RL Pipeline Overview#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart LR
    sft["SFT checkpoint"] --> mpo["MPO"]
    mpo --> text["Text RL"]
    text --> vision["Vision RL"]
    vision --> eval["Evaluation"]

    style sft fill:#f3e5f5,stroke:#9c27b0
    style mpo fill:#e1f5fe,stroke:#2196f3
    style text fill:#e8f5e9,stroke:#4caf50
    style vision fill:#fff3e0,stroke:#ff9800

The shared RL tree contains:

Path	Purpose
`stage1_rl/Dockerfile`	Shared NeMo-RL Omni image
`stage1_rl/data_prep.py`	Dispatcher for `-c mpo` / `-c text` / `-c vision`
`stage1_rl/stage1_mpo/`	MPO launcher, config, and data prep
`stage1_rl/stage2_text_rl/`	Text RL launcher, config, and data prep
`stage1_rl/stage3_vision_rl/`	Vision RL launcher stub, config, and data prep

Sub-Stages#

Sub-stage	Command	Default input model	Data prep config	Notes
MPO	`nemotron omni3 rl mpo`	`omni3-sft-model:latest`	`-c mpo`	Public MMPR preference optimization
Text RL	`nemotron omni3 rl text`	`omni3-rl-mpo-model:latest`	`-c text`	Continues alignment on text-only RL data
Vision RL	`nemotron omni3 rl vision`	`omni3-rl-text-model:latest`	`-c vision`	Data prep is wired; training launcher is still upstream-stubbed

Build the Shared RL Container#

uv run nemotron omni3 build rl --run YOUR-CLUSTER

Canonical archive path:

/home/${oc.env:USER}/.cache/nemotron/containers/omni3-rl.sqsh

For local iteration:

cd src/nemotron/recipes/omni3/stage1_rl
docker build -t nemotron/omni3-rl:latest -f Dockerfile .
# or
podman build -t nemotron/omni3-rl:latest -f Dockerfile .

Quick Start#

// 1. Build the shared RL container
$ uv run nemotron omni3 build rl --run YOUR-CLUSTER

// 2. MPO
$ uv run nemotron omni3 data prep rl -c mpo --run YOUR-CLUSTER
$ uv run nemotron omni3 rl mpo --run YOUR-CLUSTER

// 3. Text RL
$ uv run nemotron omni3 data prep rl -c text --run YOUR-CLUSTER
$ uv run nemotron omni3 rl text --run YOUR-CLUSTER

// 4. Vision RL
$ uv run nemotron omni3 data prep rl -c vision --run YOUR-CLUSTER
$ uv run nemotron omni3 rl vision --run YOUR-CLUSTER

Data Preparation#

Use one CLI command with config variants:

uv run nemotron omni3 data prep rl -c mpo --run YOUR-CLUSTER
uv run nemotron omni3 data prep rl -c text --run YOUR-CLUSTER
uv run nemotron omni3 data prep rl -c vision --run YOUR-CLUSTER

The configs under stage1_rl/config/data_prep/ map to:

Config	Source	Output
`mpo.yaml`	`hf://OpenGVLab/MMPR` (auto-downloads via `source_uri`)	`MMPR-v1.2/` cache + rewritten `meta_public.json`
`text.yaml`	`hf://nvidia/Nemotron-3-Nano-RL-Training-Blend` (resolved through Nano3’s HF placeholder pipeline)	per-blend train/val JSONL with `responses_create_params` schema
`vision.yaml`	`hf://OpenGVLab/MMPR-Tiny` (auto-downloads via `source_uri`)	`MMPR-Tiny/` cache + parquet + preview

When input_dir is empty/incomplete and source_uri is set, the dispatcher snapshot-downloads the HF repo before the prep stage runs. Pre-stage data manually (or set OMNI3_MMPR_PUBLIC_RAW / OMNI3_MMPR_TINY_RAW) to skip the download.

See data-prep.md for the full data-prep guide: auto-download semantics, helper scripts under scripts/, output layouts, artifact registration, and parallel-submission notes.

Stage-Specific Notes#

MPO#

MPO uses the SFT checkpoint as input and launches bash scripts/omni/step_1_nanov3_mpo.sh inside /opt/nemo-rl-omni.

Text RL#

Text RL consumes the MPO checkpoint and exposes the key launcher overrides described in the design doc:

CONTEXT_PARALLEL_SIZE
TRAIN_GLOBAL_BATCH_SIZE
NUM_PROMPTS_PER_STEP
NUM_GENERATIONS_PER_PROMPT
WANDB_PROJECT

Vision RL#

The vision stage is intentionally honest about its current status:

data prep is implemented
the command exists and resolves the expected config
train.py raises NotImplementedError until the upstream launcher lands

That keeps the pipeline wiring visible without pretending the missing upstream piece is available yet.

Infrastructure#

This stage uses:

Component	Role	Documentation
NeMo-RL	RL trainers and launch scripts	Docs
Ray	Distributed execution for RL and data prep	Docs
Megatron-Core	Distributed training backend	GitHub

Next Steps#

After RL completes, run benchmarks via nemotron omni3 model eval (a dedicated nemotron omni3 eval stage is on the roadmap).

Upstream#

This stage is the cookbook view of the upstream NeMo-RL omni RL flow. For the canonical end-to-end walkthrough (build, data prep, MPO/text/vision launchers, .env setup), see the NeMo-RL nano-v3-omni Nemotron 3 Nano Omni guide. The Dockerfile in this stage pins NVIDIA/NeMo-RL @ nano-v3-omni, which carries the omni vllm fork at aroshanghias-nvd/vllm nano-v3-vl as a 3rdparty/vllm submodule. Bump the NeMo-RL branch when it merges to a versioned tag.

Reference#

Recipe source: src/nemotron/recipes/omni3/stage1_rl/ (README)
Upstream: NeMo-RL nano-v3-omni guide
RL data prep deep-dive
Architecture deep-dive
Inference & deployment
Back to Overview
Execution through NeMo-Run