About Model Evaluation#

The eval/model_eval Nemotron step is a wrapper around NeMo Evaluator Launcher. It runs launcher tasks against either an existing OpenAI-compatible endpoint or a launcher-managed Megatron Bridge checkpoint deployment, then writes an eval_results artifact to disk.

Tip

New to model evaluation or the Nemotron CLI? Read Use The Model Evaluation Skill With Confidence for a short guide to productive agent sessions, then start the Getting Started With Model Evaluation tutorial to run one benchmark on one sample against a hosted endpoint.

When To Use#

Use eval/model_eval when the work matches one of the following.

Score a trained checkpoint with NeMo Evaluator Launcher tasks.
Compare a new training run against a baseline by running the same task set against both, with generation parameters and endpoint type held constant.
Perform a sample run against a hosted endpoint, to confirm the URL, credential, and model id before scaling up.
Pair this step with a baseline evaluation before training to capture before-and-after measurements around a training change, by following Comparing Runs.

Pipeline At A Glance#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart LR
    ckpt["Hugging Face or<br/>Megatron Bridge checkpoint"] --> deploy["OpenAI-compatible<br/>endpoint"]
    hosted["Hosted endpoint"] --> deploy
    deploy --> step["eval/model_eval<br/>(NeMo Evaluator)"]
    step --> results["eval_results<br/>per-benchmark subdirs"]

NeMo Evaluator Launcher owns task execution and result files under output_dir. For the contract and the on-disk layout, refer to Output Artifacts.

How It Works#

The runner reads a single YAML document, applies command-line overrides, removes Nemotron-only keys, saves the resolved launcher config, and calls nemo_evaluator_launcher.api.functional.run_eval.

The endpoint type must match the benchmark family. Chat and instruction benchmarks need a chat endpoint. Log-probability tasks, such as HellaSwag, need a completions endpoint with logprobs support and a tokenizer that matches the served model.

The hosted smoke-test config is tiny_chat.yaml. The checkpoint-evaluation config is default.yaml. Generation settings live under evaluation.nemo_evaluator_config.config.params.

For the full concept set behind these design rules, refer to Concepts.

Documentation#

Getting Started

Run one benchmark on one sample against a hosted endpoint, end to end.

15-30 min tutorial

Getting Started With Model Evaluation

Use The Model Evaluation Skill With Confidence

Run a productive agent session: opening brief, four required inputs, and how SKILL.md keeps the session focused.

10 min read newcomer

Use The Model Evaluation Skill With Confidence

How-To Guides

Discover the step, run a hosted evaluation, and evaluate a deployed checkpoint.

3 guides task-focused

Model Evaluation How-To Guides

Reference

YAML schema, command-line flags, output artifact layout, benchmark catalog, and troubleshooting.

5 references lookup

Model Evaluation Reference

Concepts

Architecture, endpoint and benchmark families, and tokenizer alignment.

3 pages explanation

Concepts

All Documentation#

Getting Started

Guide	What You Will Do	Time
Getting Started With Model Evaluation	Run a one-sample evaluation against a hosted endpoint	15-30 min
Use The Model Evaluation Skill With Confidence	Drive `eval/model_eval` from a coding agent	10 min read

How-To Guides

Guide	What You Will Do
Discover The Model Evaluation Step	List the step, read its contract, and decide whether it applies
Run A Hosted Evaluation	Run benchmarks against an already-running endpoint
Evaluate A Deployed Checkpoint	Pick a deployment path, then point the step at the endpoint

Reference

Reference	What You Will Find
Configuration Reference	YAML field reference for `default.yaml` and `tiny_chat.yaml`
CLI Reference	Flags and Hydra overrides for `nemotron steps run eval/model_eval`
Output Artifacts	`eval_results` contract and on-disk layout
Tasks Catalog	NeMo Evaluator Launcher task identifiers grouped by family
Troubleshooting	Named error modes from `step.toml`, with cause and recovery

Concepts

Concept	What You Will Learn
Concepts	Map of the concept pages and how they relate
Pipeline Overview	Artifact flow from checkpoint through `eval/model_eval` into `eval_results`
Endpoint Types And Task Families	Chat versus completions endpoints, and which benchmark families match each one
Tokenizer Alignment	Why log-probability benchmarks need a tokenizer that matches the served model

Before You Start#

The Nemotron repository is synced and uv sync is complete.
A bearer token is exported as the environment variable named in target.api_endpoint.api_key_name. Hosted smoke tests usually use NVIDIA_API_KEY.
A reachable evaluation endpoint URL and a model identifier the endpoint advertises.
A tokenizer that matches the served model when running log-probability tasks. The hosted chat smoke test does not require a tokenizer override.

Limitations And Considerations#

Cost: every benchmark sample issues at least one request to the endpoint, and hosted endpoints incur per-token cost.
Rate limits: hosted endpoints throttle concurrent requests, so set evaluation.nemo_evaluator_config.config.params.parallelism to a value the endpoint can serve.
Deployment: tiny_chat.yaml targets an already-deployed endpoint; default.yaml uses launcher-managed deployment for a Megatron Bridge checkpoint.
Comparability: scores are comparable when the endpoint type, task version, tokenizer, and generation parameters are held constant across runs. The Comparing Runs section explains the framing.

About Model Evaluation#

When To Use#

Pipeline At A Glance#

How It Works#

Documentation#

All Documentation#

Before You Start#

Limitations And Considerations#

Related Documentation#