Stage 4: Evaluation#
Evaluate trained Nemotron 3 Super models against standard benchmarks using NeMo Evaluator.
The evaluation recipe here covers a subset of the benchmarks used in the full tech report — enough to validate training quality during development. For the complete benchmark suite and reproduction instructions, see the Nemotron 3 Super reproducibility doc in the NeMo Evaluator repo.
Different execution pattern: Unlike training stages that submit Python scripts via NeMo-Run, evaluation compiles the YAML config and passes it directly to nemo-evaluator-launcher. There is no recipe script—the CLI handles config compilation and artifact resolution, then delegates to the launcher.
How Evaluation Works#
The eval command resolves model artifacts from W&B lineage and uses NeMo Framework’s Ray-based in-framework deployment. It defaults to evaluating the latest RL stage output.
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
subgraph cli["Nemotron CLI"]
direction LR
yaml["YAML Config"] --> compile["Config Compilation"]
compile --> save["Save job.yaml + eval.yaml"]
end
subgraph launcher["nemo-evaluator-launcher"]
direction LR
deploy["Deploy Model<br/>(NeMo Ray)"] --> run["Run Benchmarks"] --> export["Export Results<br/>(W&B)"]
end
save --> launcher
style cli fill:#e3f2fd,stroke:#2196f3
style launcher fill:#f3e5f5,stroke:#9c27b0
Deployment#
The model checkpoint (HuggingFace format) is converted to Megatron-Bridge format and deployed with Ray as an OpenAI API-compatible endpoint. The evaluator launcher handles deployment, benchmark execution, and result export in a single command.
Setting |
Value |
Notes |
|---|---|---|
Deployment backend |
MBridge + Ray |
In-framework NeMo deployment |
Minimum GPUs |
8 (1 node) |
Expert parallelism requires 8 GPUs |
Tensor parallelism (TP) |
1 |
Single-GPU tensor parallelism |
Expert parallelism (EP) |
8 |
One expert shard per GPU |
Benchmark Suite#
The evaluation covers six categories of benchmarks, matching the tech report evaluation protocol:
General Knowledge#
Benchmark |
Description |
|---|---|
MMLU-Pro |
Massive Multitask Language Understanding (Professional) |
Reasoning#
Benchmark |
Description |
|---|---|
AIME25 |
American Invitational Mathematics Examination 2025 (no tools / with tools) |
HMMT |
Harvard-MIT Mathematics Tournament (no tools) |
GPQA |
Graduate-Level Google-Proof QA (no tools / with tools) |
LiveCodeBench v5 |
Live competitive coding (2024-08 to 2025-05) |
SciCode |
Scientific coding (subtask) |
HLE |
Humanity’s Last Exam (no tools / with tools) |
Agentic#
Benchmark |
Description |
|---|---|
TerminalBench |
Terminal use (hard subset + v2.0) |
SWE-Bench |
Software engineering (OpenHands, OpenCode, Codex harnesses + Multilingual) |
TauBench V2 |
Conversational tool use (Airline, Retail, Telecom) |
BrowseComp |
Web browsing comprehension |
BIRD Bench |
Text-to-SQL (dev set, SQLite, execution accuracy) |
Chat & Instruction Following#
Benchmark |
Description |
|---|---|
IFBench |
Instruction following (prompt-level) |
Multi-Challenge |
Complex multi-constraint instructions |
Arena-Hard-V2 |
Hard prompt evaluation |
Long Context#
Benchmark |
Description |
|---|---|
AA-LCR |
Long-context reasoning |
RULER-100 |
Retrieval tasks at 256K, 512K, and 1M context |
Multilingual#
Benchmark |
Description |
|---|---|
MMLU-ProX |
Multilingual MMLU-Pro (averaged over languages) |
WMT24++ |
Machine translation (en→xx) |
Post-Trained Model Results#
Comparison against Qwen3.5-122B-A10B and GPT-OSS-120B (officially reported numbers are used when available; otherwise scores are computed using official evaluation settings):
Benchmark |
N-3-Super |
Qwen3.5-122B-A10B |
GPT-OSS-120B |
|---|---|---|---|
General Knowledge |
|||
MMLU-Pro |
83.73 |
86.70 |
81.00 |
Reasoning |
|||
AIME25 (no tools) |
90.21 |
90.36 |
92.50 |
HMMT (no tools) |
93.67 |
91.67 |
92.33 |
GPQA (no tools) |
79.23 |
86.60 |
80.10 |
GPQA (with tools) |
82.70 |
— |
80.09 |
LiveCodeBench v5 |
78.73 |
78.93 |
88.00 |
SciCode (subtask) |
42.05 |
42.00 |
39.00 |
HLE (no tools) |
18.26 |
25.30 |
14.90 |
HLE (with tools) |
22.82 |
— |
19.00 |
Agentic |
|||
TerminalBench (hard) |
22.30 |
26.80 |
24.00 |
TerminalBench 2.0 |
31.00 |
37.50 |
18.70 |
SWE-Bench (OpenHands) |
60.47 |
66.40 |
41.90 |
SWE-Bench (OpenCode) |
59.20 |
67.40 |
— |
SWE-Bench (Codex) |
53.73 |
61.20 |
— |
SWE-Bench Multilingual |
45.78 |
— |
30.80 |
TauBench V2 Airline |
66.20 |
66.00 |
49.20 |
TauBench V2 Retail |
62.80 |
62.60 |
67.80 |
TauBench V2 Telecom |
64.91 |
95.00 |
66.00 |
TauBench V2 Average |
64.64 |
74.53 |
61.00 |
BrowseComp |
31.28 |
TBD |
33.89 |
BIRD Bench |
41.80 |
— |
38.25 |
Chat & IF |
|||
IFBench (prompt) |
75.03 |
76.10 |
65.00 |
Multi-Challenge |
55.23 |
61.50 |
58.29 |
Arena-Hard-V2 |
73.88 |
75.15 |
90.26 |
Long Context |
|||
AA-LCR |
59.67 |
66.90 |
51.00 |
RULER-100 @ 256k |
96.30 |
TBD |
52.30 |
RULER-100 @ 512k |
95.67 |
TBD |
46.70 |
RULER-100 @ 1M |
91.75 |
TBD |
22.30 |
Multilingual |
|||
MMLU-ProX (avg) |
80.00 |
82.20 |
75.90 |
WMT24++ (en→xx) |
87.30 |
78.30 |
87.80 |
Base Model Validation Results#
The following table validates the MBridge deployment by comparing accuracy against research team numbers on the base (pretrained) model:
Benchmark |
MBridge Deployment |
Research Team |
Delta |
|---|---|---|---|
MMLU (5-shot) |
85.86 |
85.89 |
-0.03 |
ARC-Challenge (25-shot) |
95.82 |
95.65 |
+0.17 |
Winogrande (5-shot) |
78.37 |
78.69 |
-0.32 |
HellaSwag (10-shot) |
88.96 |
88.99 |
-0.03 |
OpenBookQA (0-shot) |
48.80 |
50.20 |
-1.40 |
Recipe Execution#
Quick Start#
// Evaluate the latest RL model from the pipeline
$ uv run nemotron super3 eval --run YOUR-CLUSTER
// Evaluate a specific model artifact
$ uv run nemotron super3 eval --run YOUR-CLUSTER run.model=sft:v2
// Filter to specific benchmarks
$ uv run nemotron super3 eval --run YOUR-CLUSTER -t adlr_mmlu -t hellaswag
// Dry run: preview the resolved config without executing
$ uv run nemotron super3 eval --dry-run
Note: The
--run YOUR-CLUSTERflag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.
Prerequisites#
NeMo Evaluator: Install with
pip install "nemotron[evaluator]"or ensurenemo-evaluator-launcheris availableHF_TOKEN: Required for gated models and some benchmark datasetsWeights & Biases: For result export (optional but recommended)
Slurm cluster: For remote execution
Configuration#
File |
Purpose |
|---|---|
|
Evaluation config with deployment and benchmark tasks |
Default Evaluation Tasks#
The recipe config ships with the following default benchmarks:
Task |
Benchmark |
Shots |
|---|---|---|
|
MMLU |
5-shot |
|
ARC-Challenge |
25-shot |
|
HellaSwag |
10-shot |
|
OpenBookQA |
0-shot |
|
Winogrande |
5-shot |
Artifact Resolution#
The default config uses ${art:model,path} for the model checkpoint:
run:
model: rl:latest # Resolve latest RL artifact
deployment:
checkpoint_path: ${art:model,path} # Resolved at runtime
Override the model artifact on the command line:
# Evaluate the SFT model instead of RL
uv run nemotron super3 eval --run YOUR-CLUSTER run.model=sft:latest
# Evaluate a specific version
uv run nemotron super3 eval --run YOUR-CLUSTER run.model=sft:v2
# Use an explicit path (bypasses artifact resolution)
uv run nemotron super3 eval --run YOUR-CLUSTER deployment.checkpoint_path=/path/to/checkpoint
Task Filtering#
Use -t/--task flags to run a subset of benchmarks:
# Single task
uv run nemotron super3 eval --run YOUR-CLUSTER -t adlr_mmlu
# Multiple tasks
uv run nemotron super3 eval --run YOUR-CLUSTER -t adlr_mmlu -t hellaswag -t arc_challenge
Direct Evaluation with nemo-evaluator-launcher#
You can run evaluation standalone without the nemotron CLI by using nemo-evaluator-launcher directly. This is useful for custom setups or when integrating into existing pipelines.
Upstream reproducibility guide: For full reproduction instructions (including config files and expected scores), see the Nemotron 3 Super reproducibility doc in the NeMo Evaluator repo.
1. Create a virtual environment and install:
python -m venv eval-venv
source eval-venv/bin/activate
pip install "nemo-evaluator-launcher[all]"
2. Set your HuggingFace token (required for gated models and some benchmarks):
export HF_TOKEN=<your-hf-token>
3. Run evaluation:
nemo-evaluator-launcher run --config /path/to/config.yaml
The config file follows the same schema as config/default.yaml. The launcher handles model deployment (MBridge + Ray), benchmark execution, and result export.
Running with NeMo-Run#
Configure execution profiles in env.toml:
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 1
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]
See Execution through NeMo-Run for complete configuration options.
W&B Integration#
Results are automatically exported to W&B when configured:
Auto-detection: The CLI detects your local
wandb loginand propagatesWANDB_API_KEYto evaluation containersenv.toml config:
WANDB_PROJECTandWANDB_ENTITYare loaded fromenv.tomlAuto-export: Results are exported after evaluation completes
See W&B Integration for setup.
Artifact Lineage#
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart TB
subgraph pipeline["Training Pipeline"]
pretrain["ModelArtifact-pretrain"] --> sft["ModelArtifact-sft"]
sft --> rl["ModelArtifact-rl"]
end
rl --> eval["nemotron super3 eval"]
sft -.-> eval
eval --> results["Evaluation Results<br/>(W&B)"]
style pipeline fill:#e1f5fe,stroke:#2196f3
style eval fill:#f3e5f5,stroke:#9c27b0
style results fill:#e8f5e9,stroke:#4caf50
Infrastructure#
This stage uses the following components:
Component |
Role |
Documentation |
|---|---|---|
Benchmark evaluation framework and launcher |
||
Ray-based in-framework model deployment |
Parallelism Configuration#
Setting |
Value |
Purpose |
|---|---|---|
|
1 |
Tensor parallelism per GPU |
|
8 |
Expert parallelism for MoE layers |
|
8 |
Total GPUs per node |
Troubleshooting#
Problem |
Solution |
|---|---|
|
Install with |
W&B authentication fails |
Run |
Model deployment fails |
Check parallelism settings match GPU config (TP=1, EP=8 for Super3) |
Artifact resolution fails |
Verify artifact exists in W&B. Use |
Task not found |
List available tasks with |
Previous Stages#
Stage 0: Pretraining — Pretrain the base model
Stage 1: SFT — Instruction tuning
Stage 2: RL — Reinforcement learning alignment
Stage 3: Quantization — Post-training quantization
Reference#
NeMo Evaluator — Upstream evaluation framework
Nemotron 3 Super Reproducibility Guide — Full reproduction instructions with configs and expected scores
Artifact Lineage — W&B artifact system
Execution through NeMo-Run — Cluster configuration
W&B Integration — Credentials and export setup
Recipe Source:
src/nemotron/recipes/super3/stage3_eval/