Stage 3: Evaluation#
Evaluate trained Nemotron Nano 3 models against standard benchmarks using NeMo Evaluator.
Different execution pattern: Unlike training stages that submit Python scripts via NeMo-Run, evaluation compiles the YAML config and passes it directly to nemo-evaluator-launcher. There is no recipe scriptโthe CLI handles config compilation and artifact resolution, then delegates to the launcher.
How Evaluation Works#
The eval command resolves model artifacts from W&B lineage and uses NeMo Frameworkโs Ray-based in-framework deployment. It defaults to evaluating the latest RL stage output.
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
subgraph cli["Nemotron CLI"]
direction LR
yaml["YAML Config"] --> compile["Config Compilation"]
compile --> save["Save job.yaml + eval.yaml"]
end
subgraph launcher["nemo-evaluator-launcher"]
direction LR
deploy["Deploy Model<br/>(NeMo Ray)"] --> run["Run Benchmarks"] --> export["Export Results<br/>(W&B)"]
end
save --> launcher
style cli fill:#e3f2fd,stroke:#2196f3
style launcher fill:#f3e5f5,stroke:#9c27b0
Config Compilation Pipeline#
The CLI performs several transformations on the YAML config before passing it to the launcher:
Load the YAML config via OmegaConf (with Hydra defaults resolution)
Merge env.toml profile values and CLI dotlist overrides
Auto-inject W&B credential mappings if W&B export is configured
Auto-squash container images for Slurm (converts Docker images to
.sqshfiles)Strip the
runsection and resolve all${run.*}interpolationsResolve artifact references (
${art:model,path}) via W&B ArtifactsPass the cleaned config to
nemo-evaluator-launcherโsrun_eval()
Two YAML files are saved for provenance:
job.yamlโ full config includingrunsection (for reproducibility)eval.yamlโ compiled config as seen by the launcher
Deployment#
The default config uses NeMo Frameworkโs Ray-based in-framework deployment (type: generic) with a custom command for serving:
deployment:
type: generic
multiple_instances: true
image: nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
checkpoint_path: ${art:model,path}
port: 1235
command: >-
bash -c 'python deploy_ray_inframework.py
--megatron_checkpoint /checkpoint/
--num_gpus 8
--tensor_model_parallel_size 2
--expert_model_parallel_size 8
--port 1235'
Parallelism settings are tuned for the Nano3 30B MoE model:
Setting |
Value |
Purpose |
|---|---|---|
|
2 |
Tensor parallelism across GPUs |
|
8 |
Expert parallelism for MoE layers |
|
8 |
Total GPUs per node |
|
1235 |
Ray serving port |
The model is deployed using the same NeMo Megatron container as training (nvcr.io/nvidia/nemo:25.11.nemotron_3_nano); nemo-evaluator-launcher pulls its own containers for evaluation tasks.
Evaluation Tasks#
Tasks are defined in the evaluation.tasks list. Each task maps to a benchmark supported by NeMo Evaluator:
evaluation:
tasks:
- name: adlr_mmlu
nemo_evaluator_config: # Optional per-task overrides
config:
params:
top_p: 0.0
- name: adlr_arc_challenge_llama_25_shot
- name: adlr_winogrande_5_shot
- name: hellaswag
- name: openbookqa
The default config includes five standard benchmarks:
Task |
Type |
Description |
|---|---|---|
|
Text Generation |
Massive Multitask Language Understanding |
|
Log Probability |
ARC Challenge with 25-shot prompting |
|
Log Probability |
Winogrande commonsense reasoning |
|
Log Probability |
Commonsense sentence completion |
|
Log Probability |
Open-domain science questions |
To discover additional tasks: nemo-evaluator-launcher ls tasks
Recipe Execution#
Quick Start#
// Evaluate the latest RL model from the pipeline
$ uv run nemotron nano3 eval --run YOUR-CLUSTER
// Evaluate a specific model artifact
$ uv run nemotron nano3 eval --run YOUR-CLUSTER run.model=sft:v2
// Filter to specific benchmarks
$ uv run nemotron nano3 eval --run YOUR-CLUSTER -t adlr_mmlu -t hellaswag
// Dry run: preview the resolved config without executing
$ uv run nemotron nano3 eval --dry-run
Note: The
--run YOUR-CLUSTERflag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.
Prerequisites#
NeMo Evaluator: Install with
pip install "nemotron[evaluator]"or ensurenemo-evaluator-launcheris availableContainer image:
nvcr.io/nvidia/nemo:25.11.nemotron_3_nano(NeMo Megatron container for model serving)Weights & Biases: For result export (optional but recommended)
Slurm cluster: For remote execution
Configuration#
File |
Purpose |
|---|---|
|
Evaluation config with NeMo Ray deployment and benchmark tasks |
The config has five sections:
# Nemotron extension (stripped before passing to launcher)
run:
model: rl:latest # W&B artifact reference
env: # Populated from env.toml profile
container: nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
executor: slurm
host: ${oc.env:HOSTNAME,localhost}
...
wandb:
entity: null
project: null
# Passed directly to nemo-evaluator-launcher
execution:
type: slurm
num_nodes: 1
gres: gpu:8
auto_export:
enabled: true
destinations: [wandb]
deployment:
type: generic # NeMo Framework Ray
checkpoint_path: ${art:model,path} # Resolved from W&B artifact
command: >-
bash -c 'python deploy_ray_inframework.py
--megatron_checkpoint /checkpoint/
--num_gpus 8
--tensor_model_parallel_size 2
--expert_model_parallel_size 8
--port 1235'
evaluation:
nemo_evaluator_config:
config:
params:
parallelism: 4
request_timeout: 6000
tasks:
- name: adlr_mmlu
- name: adlr_arc_challenge_llama_25_shot
- name: adlr_winogrande_5_shot
- name: hellaswag
- name: openbookqa
export:
wandb:
entity: ${run.wandb.entity}
project: ${run.wandb.project}
Section |
Purpose |
Passed to Launcher? |
|---|---|---|
|
env.toml injection, artifact references |
No (stripped) |
|
Where to run, auto-export, mounts |
Yes |
|
How to serve the model |
Yes |
|
Tasks and evaluation parameters |
Yes |
|
Result destinations (W&B) |
Yes |
Artifact Resolution#
The default config uses ${art:model,path} for the model checkpoint:
run:
model: rl:latest # Resolve latest RL artifact
deployment:
checkpoint_path: ${art:model,path} # Resolved at runtime
Override the model artifact on the command line:
# Evaluate the SFT model instead of RL
uv run nemotron nano3 eval --run YOUR-CLUSTER run.model=sft:latest
# Evaluate a specific version
uv run nemotron nano3 eval --run YOUR-CLUSTER run.model=sft:v2
# Use an explicit path (bypasses artifact resolution)
uv run nemotron nano3 eval --run YOUR-CLUSTER deployment.checkpoint_path=/path/to/checkpoint
Task Filtering#
Use -t/--task flags to run a subset of benchmarks:
# Single task
uv run nemotron nano3 eval --run YOUR-CLUSTER -t adlr_mmlu
# Multiple tasks
uv run nemotron nano3 eval --run YOUR-CLUSTER -t adlr_mmlu -t hellaswag -t openbookqa
Override Examples#
# Increase evaluation parallelism
uv run nemotron nano3 eval evaluation.nemo_evaluator_config.config.params.parallelism=16
# Change walltime
uv run nemotron nano3 eval --run YOUR-CLUSTER run.env.time=08:00:00
Running with NeMo-Run#
Configure execution profiles in env.toml:
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 1
ntasks_per_node = 1
gpus_per_node = 8
mounts = ["/lustre:/lustre"]
See Execution through NeMo-Run for complete configuration options.
W&B Integration#
Results are automatically exported to W&B when configured:
Auto-detection: The CLI detects your local
wandb loginand propagatesWANDB_API_KEYto evaluation containersenv.toml config:
WANDB_PROJECTandWANDB_ENTITYare loaded fromenv.tomlAuto-export: Results are exported after evaluation completes when
execution.auto_export.destinationsincludeswandb
See W&B Integration for setup.
Artifact Lineage#
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart TB
subgraph pipeline["Training Pipeline"]
pretrain["ModelArtifact-pretrain"] --> sft["ModelArtifact-sft"]
sft --> rl["ModelArtifact-rl"]
end
rl --> eval["nemotron nano3 eval"]
sft -.-> eval
eval --> results["Evaluation Results<br/>(W&B)"]
style pipeline fill:#e1f5fe,stroke:#2196f3
style eval fill:#f3e5f5,stroke:#9c27b0
style results fill:#e8f5e9,stroke:#4caf50
Infrastructure#
This stage uses the following components:
Component |
Role |
Documentation |
|---|---|---|
Benchmark evaluation framework and launcher |
||
Ray-based in-framework model deployment |
Container#
nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
The NeMo Megatron container is used for model serving. The nemo-evaluator-launcher pulls its own containers for running evaluation tasks.
CLI Reference#
$ uv run nemotron nano3 eval --help
Usage: nemotron nano3 eval [OPTIONS]
Run evaluation with NeMo-Evaluator (stage3).
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ -c, --config NAME Config name or path โ
โ -r, --run PROFILE Submit to cluster (attached) โ
โ -b, --batch PROFILE Submit to cluster (detached) โ
โ -d, --dry-run Preview config without execution โ
โ -t, --task NAME Filter to specific task (repeatable) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Artifact Overrides โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ run.model Model checkpoint artifact (default: rl:latest) โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Troubleshooting#
Problem |
Solution |
|---|---|
|
Install with |
W&B authentication fails |
Run |
Model deployment fails |
Check parallelism settings match GPU config (TP=2, EP=8 for Nano3) |
Artifact resolution fails |
Verify artifact exists in W&B. Use |
Task not found |
List available tasks with |
Previous Stage#
After RL completes in Stage 2: RL, evaluation is the final step in the pipeline.
Reference#
NeMo Evaluator โ Upstream evaluation framework
Artifact Lineage โ W&B artifact system
Execution through NeMo-Run โ Cluster configuration
W&B Integration โ Credentials and export setup
Recipe Source:
src/nemotron/recipes/nano3/stage3_eval/