Stage 1: Reinforcement Learning (RL)#
Omni RL continues the multimodal post-training pipeline with NeMo-RL using one shared container and three explicit sub-stages. RL aligns the model’s perception-sub-agent surface — preference quality on visual reasoning, factual grounding for downstream agent calls, and ASR fidelity. The full upstream alignment corpus runs ~2.3M rollouts across 20 RL datasets / 25 environments covering visual grounding, charts, vision-critical STEM, video understanding, and ASR (per the release blog). This recipe folder surfaces the 3 of 25 environments that have public data: MMPR (preference), the Nano RL training blend (text), and MMPR-Tiny (vision); the remaining 22 environments use internal or third-party data and aren’t included.
Shared container: All RL sub-stages use
src/nemotron/recipes/omni3/stage1_rl/Dockerfile. Thenemotron omni3 build rldispatcher turns it intoomni3-rl.sqshunder yourbuild_cache_dir.
Current notes (also summarized in the family README):
All three sub-stages have working launchers that mirror the upstream NeMo-RL
nano-v3-omniflow (scripts/nanov3_mpo.sh,scripts/nanov3_text_rl.sh,scripts/nanov3_vision_rl.sh).
stage1_rl/Dockerfilemirrors NeMo-RL’s release Dockerfile — clonesNVIDIA/NeMo-RL @ nano-v3-omnirecursively (carrying the omni vllm fork as a3rdparty/vllmsubmodule) and runs the sameBUILD_CUSTOM_VLLM=1+uv syncflow as the upstreamdocker/Dockerfile.omni3 build rlproducesomni3-rl.sqsh.
nano-v3-omniis the active release branch for Nemotron 3 Omni; bump to a versioned tag (ormain) once these changes merge upstream.
RL Pipeline Overview#
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart LR
sft["SFT checkpoint"] --> mpo["MPO"]
mpo --> text["Text RL"]
text --> vision["Vision RL"]
vision --> eval["Evaluation"]
style sft fill:#f3e5f5,stroke:#9c27b0
style mpo fill:#e1f5fe,stroke:#2196f3
style text fill:#e8f5e9,stroke:#4caf50
style vision fill:#fff3e0,stroke:#ff9800
The shared RL tree contains:
Path |
Purpose |
|---|---|
|
Shared NeMo-RL Omni image |
|
Dispatcher for |
|
MPO launcher, config, and data prep |
|
Text RL launcher, config, and data prep |
|
Vision RL launcher stub, config, and data prep |
Sub-Stages#
Sub-stage |
Command |
Default input model |
Data prep config |
Notes |
|---|---|---|---|---|
MPO |
|
|
|
Public MMPR preference optimization |
Text RL |
|
|
|
Continues alignment on text-only RL data |
Vision RL |
|
|
|
Data prep is wired; training launcher is still upstream-stubbed |
Quick Start#
// 1. Build the shared RL container
$ uv run nemotron omni3 build rl --run YOUR-CLUSTER
// 2. MPO
$ uv run nemotron omni3 data prep rl -c mpo --run YOUR-CLUSTER
$ uv run nemotron omni3 rl mpo --run YOUR-CLUSTER
// 3. Text RL
$ uv run nemotron omni3 data prep rl -c text --run YOUR-CLUSTER
$ uv run nemotron omni3 rl text --run YOUR-CLUSTER
// 4. Vision RL
$ uv run nemotron omni3 data prep rl -c vision --run YOUR-CLUSTER
$ uv run nemotron omni3 rl vision --run YOUR-CLUSTER
Data Preparation#
Use one CLI command with config variants:
uv run nemotron omni3 data prep rl -c mpo --run YOUR-CLUSTER
uv run nemotron omni3 data prep rl -c text --run YOUR-CLUSTER
uv run nemotron omni3 data prep rl -c vision --run YOUR-CLUSTER
The configs under stage1_rl/config/data_prep/ map to:
Config |
Source |
Output |
|---|---|---|
|
|
|
|
|
per-blend train/val JSONL with |
|
|
|
When input_dir is empty/incomplete and source_uri is set, the
dispatcher snapshot-downloads the HF repo before the prep stage runs.
Pre-stage data manually (or set OMNI3_MMPR_PUBLIC_RAW /
OMNI3_MMPR_TINY_RAW) to skip the download.
See data-prep.md for the full data-prep guide: auto-download semantics, helper scripts under
scripts/, output layouts, artifact registration, and parallel-submission notes.
Stage-Specific Notes#
MPO#
MPO uses the SFT checkpoint as input and launches bash scripts/omni/step_1_nanov3_mpo.sh inside /opt/nemo-rl-omni.
Text RL#
Text RL consumes the MPO checkpoint and exposes the key launcher overrides described in the design doc:
CONTEXT_PARALLEL_SIZETRAIN_GLOBAL_BATCH_SIZENUM_PROMPTS_PER_STEPNUM_GENERATIONS_PER_PROMPTWANDB_PROJECT
Vision RL#
The vision stage is intentionally honest about its current status:
data prep is implemented
the command exists and resolves the expected config
train.pyraisesNotImplementedErroruntil the upstream launcher lands
That keeps the pipeline wiring visible without pretending the missing upstream piece is available yet.
Infrastructure#
This stage uses:
Next Steps#
After RL completes, run benchmarks via nemotron omni3 model eval (a dedicated nemotron omni3 eval stage is on the roadmap).
Upstream#
This stage is the cookbook view of the upstream NeMo-RL omni RL flow.
For the canonical end-to-end walkthrough (build, data prep,
MPO/text/vision launchers, .env setup), see the NeMo-RL
nano-v3-omni Nemotron 3 Nano Omni guide.
The Dockerfile in this stage pins NVIDIA/NeMo-RL @ nano-v3-omni,
which carries the omni vllm fork at
aroshanghias-nvd/vllm nano-v3-vl
as a 3rdparty/vllm submodule. Bump the NeMo-RL branch when it merges
to a versioned tag.