Multi-Environment RLVR (Stages 1.1–1.3)#

Multi-environment Reinforcement Learning from Verifiable Rewards (RLVR) is the primary RL stage for Nemotron 3 Super. It trains on 21 environments and 37 datasets simultaneously, covering math, code, STEM, safety, chat, instruction following, long context, puzzles, and agentic tasks.

Training on all environments simultaneously yields stable gains — single-environment training leads to severe regressions on other benchmarks.

The RLVR stage consists of 3 sub-stages with different data blends. All 3 use the same config (stage1_rlvr.yaml), only the data paths differ.

Data#

Domain	Description
Math	Competitive math problems, trained with and without python execution tool. Includes formal proof verification environment.
Code	Competition code data.
STEM	Scientific problems including newly curated difficult problems.
Instruction Following	Standard IF data plus a multi-challenge style dataset where the agent follows complex instructions scored against a rubric.
Safety	Two environments: (1) mitigating overrefusals on safety-related prompts, (2) robustness against jailbreak attacks using PAIR-generated adversarial prompts.
Long Context	Long context reasoning environment.
Agentic Tool Use	Conversational tool use and terminal use environments.
Reasoning Gym	Diverse reasoning tasks from Reasoning Gym.

Data Curriculum: Prompts where the SFT model consistently provides correct answers are filtered out. Remaining samples are sorted via a difficulty-based curriculum.

Low-Effort Reasoning#

A subset of prompts are converted to low-effort mode. For each low-effort prompt, the reward accounts for both correctness and token count, encouraging efficient reasoning.

Phase	Scope	Proportion
Early	Math, STEM QA, Competitive Coding	2% of all RL prompts
Late	Math, STEM QA only	1% of RL prompts

Algorithm#

Uses asynchronous GRPO where training and inference are decoupled across separate GPU devices:

Inference workers continuously generate trajectories stored in a rollout buffer
Batches are sent to the training engine once enough trajectories are collected
Updated weights are pushed to inference workers as soon as a new model version is available
In-flight weight updates: weights can be pushed mid-rollout without waiting for ongoing rollouts to finish
KV cache is NOT recomputed after weight updates
Policy lag is limited to at most one step behind the latest model version
Importance sampling ratio masking stabilizes training under the training-inference mismatch

Configuration#

Parameter	Value
Nodes	109 (872 GPUs)
Generation nodes	72 (colocated=false)
NeMo Gym GPU nodes	5
Prompts/step	256
Generations/prompt	16
Batch size	4,096
Max sequence length	49,152 → 65,536 (increased during training)
TP / CP	4 / 8
Learning rate	3e-6
KL penalty	0
Overlong filtering	false

Parallelism#

Parallelism	Value	Config Key
Tensor (TP)	4	`policy.megatron_cfg.tensor_model_parallel_size`
Pipeline (PP)	1	`policy.megatron_cfg.pipeline_model_parallel_size`
Context (CP)	8	`policy.megatron_cfg.context_parallel_size`
Expert (EP)	8	`policy.megatron_cfg.expert_model_parallel_size`
Sequence (SP)	Yes	`policy.megatron_cfg.sequence_parallel`

Config Files#

stage1_rlvr/config/default.yaml — Full-scale 109-node config
stage1_rlvr/config/small.yaml — Reduced 21-node variant for testing

Infrastructure#

All RLVR experiments use an integrated NeMo RL + NeMo Gym infrastructure:

NeMo RL acts as the RL training loop controller using Megatron-Core for model training
NeMo Gym handles rollout environments using three server types: agents, models (vLLM), and resources (verifiers)
Ray orchestrates resource management on SLURM — training workers, vLLM generation workers, Gym environments, and judge models all run on a single Ray cluster

Judge Models#

Model	Purpose
Qwen3-235B-A22B	Equivalence/instruction-following judging
Nemotron-Content-Safety-Reasoning-4B	Safety evaluation
Qwen3-Nemotron-235B-A22B-GenRM	GenRM pairwise comparison

Resiliency at 1K GPU Scale#

At 1K GPU scale, intermittent failures from hardware and software issues required:

Parallelized initialization with prefetching of virtual environments and binaries
Careful port management to avoid TOCTOU race conditions between Ray control plane, vLLM workers, TCP rendezvous, and NeMo Gym servers
Caching in upstream repos (vLLM, flashinfer) to reduce startup time

Usage#

Using nemotron CLI#

# Stage 1.1–1.3: RLVR (uses base container)
uv run nemotron super3 rl rlvr -c rlvr1 --run YOUR-CLUSTER
uv run nemotron super3 rl rlvr -c rlvr2 --run YOUR-CLUSTER
uv run nemotron super3 rl rlvr -c rlvr3 --run YOUR-CLUSTER

--run YOUR-CLUSTER refers to a profile defined in your env.toml file. See the env.toml setup guide for details.

Using super_launch.sh#

Stage 1.1 — RLVR 1 (109 nodes)#

EXP_NAME=stage1.1-rlvr1 \
CONFIG_PATH=examples/configs/super/stage1_rlvr.yaml \
MODEL_PATH=/path/to/sft_checkpoint \
TRAIN_PATH=$DATA_DIR/rlvr1/train-split.jsonl \
VAL_PATH=$DATA_DIR/rlvr1/val-split.jsonl \
CONTAINER=nvcr.io/nvidia/nemo-rl:v0.5.0.nemotron_3_super \
SANDBOX_CONTAINER=$SANDBOX_CONTAINER \
PERSISTENT_CACHE=$PERSISTENT_CACHE \
EXTRA_MOUNTS=$EXTRA_MOUNTS \
SLURM_PARTITION=$SLURM_PARTITION \
SLURM_ACCOUNT=$SLURM_ACCOUNT \
bash super_launch.sh

Stage 1.2 — RLVR 2 (109 nodes)#

EXP_NAME=stage1.2-rlvr2 \
CONFIG_PATH=examples/configs/super/stage1_rlvr.yaml \
MODEL_PATH=/path/to/rlvr1_checkpoint \
TRAIN_PATH=$DATA_DIR/rlvr2/train-split.jsonl \
VAL_PATH=$DATA_DIR/rlvr2/val-split.jsonl \
CONTAINER=nvcr.io/nvidia/nemo-rl:v0.5.0.nemotron_3_super \
SANDBOX_CONTAINER=$SANDBOX_CONTAINER \
PERSISTENT_CACHE=$PERSISTENT_CACHE \
EXTRA_MOUNTS=$EXTRA_MOUNTS \
SLURM_PARTITION=$SLURM_PARTITION \
SLURM_ACCOUNT=$SLURM_ACCOUNT \
bash super_launch.sh

Stage 1.3 — RLVR 3 (109 nodes)#

EXP_NAME=stage1.3-rlvr3 \
CONFIG_PATH=examples/configs/super/stage1_rlvr.yaml \
MODEL_PATH=/path/to/rlvr2_checkpoint \
TRAIN_PATH=$DATA_DIR/rlvr3/train-split.jsonl \
VAL_PATH=$DATA_DIR/rlvr3/val-split.jsonl \
CONTAINER=nvcr.io/nvidia/nemo-rl:v0.5.0.nemotron_3_super \
SANDBOX_CONTAINER=$SANDBOX_CONTAINER \
PERSISTENT_CACHE=$PERSISTENT_CACHE \
EXTRA_MOUNTS=$EXTRA_MOUNTS \
SLURM_PARTITION=$SLURM_PARTITION \
SLURM_ACCOUNT=$SLURM_ACCOUNT \
bash super_launch.sh

See the upstream training guide for full details on environment variables.

Recipe Source: src/nemotron/recipes/super3/stage2_rl/stage1_rlvr/