About Model Evaluation#

The eval/model_eval Nemotron step is a wrapper around NeMo Evaluator Launcher. It runs launcher tasks against either an existing OpenAI-compatible endpoint or a launcher-managed Megatron Bridge checkpoint deployment, then writes an eval_results artifact to disk.

Tip

New to model evaluation or the Nemotron CLI? Read Use The Model Evaluation Skill With Confidence for a short guide to productive agent sessions, then start the Getting Started With Model Evaluation tutorial to run one benchmark on one sample against a hosted endpoint.

When To Use#

Use eval/model_eval when the work matches one of the following.

  • Score a trained checkpoint with NeMo Evaluator Launcher tasks.

  • Compare a new training run against a baseline by running the same task set against both, with generation parameters and endpoint type held constant.

  • Perform a sample run against a hosted endpoint, to confirm the URL, credential, and model id before scaling up.

  • Pair this step with a baseline evaluation before training to capture before-and-after measurements around a training change, by following Comparing Runs.

Pipeline At A Glance#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart LR
    ckpt["Hugging Face or<br/>Megatron Bridge checkpoint"] --> deploy["OpenAI-compatible<br/>endpoint"]
    hosted["Hosted endpoint"] --> deploy
    deploy --> step["eval/model_eval<br/>(NeMo Evaluator)"]
    step --> results["eval_results<br/>per-benchmark subdirs"]
    

NeMo Evaluator Launcher owns task execution and result files under output_dir. For the contract and the on-disk layout, refer to Output Artifacts.

How It Works#

The runner reads a single YAML document, applies command-line overrides, removes Nemotron-only keys, saves the resolved launcher config, and calls nemo_evaluator_launcher.api.functional.run_eval.

The endpoint type must match the benchmark family. Chat and instruction benchmarks need a chat endpoint. Log-probability tasks, such as HellaSwag, need a completions endpoint with logprobs support and a tokenizer that matches the served model.

The hosted smoke-test config is tiny_chat.yaml. The checkpoint-evaluation config is default.yaml. Generation settings live under evaluation.nemo_evaluator_config.config.params.

For the full concept set behind these design rules, refer to Concepts.

Documentation#

Getting Started

Run one benchmark on one sample against a hosted endpoint, end to end.

Getting Started With Model Evaluation
Use The Model Evaluation Skill With Confidence

Run a productive agent session: opening brief, four required inputs, and how SKILL.md keeps the session focused.

Use The Model Evaluation Skill With Confidence
How-To Guides

Discover the step, run a hosted evaluation, and evaluate a deployed checkpoint.

Model Evaluation How-To Guides
Reference

YAML schema, command-line flags, output artifact layout, benchmark catalog, and troubleshooting.

Model Evaluation Reference
Concepts

Architecture, endpoint and benchmark families, and tokenizer alignment.

Concepts

All Documentation#

Guide

What You Will Do

Time

Getting Started With Model Evaluation

Run a one-sample evaluation against a hosted endpoint

15-30 min

Use The Model Evaluation Skill With Confidence

Drive eval/model_eval from a coding agent

10 min read

Guide

What You Will Do

Discover The Model Evaluation Step

List the step, read its contract, and decide whether it applies

Run A Hosted Evaluation

Run benchmarks against an already-running endpoint

Evaluate A Deployed Checkpoint

Pick a deployment path, then point the step at the endpoint

Reference

What You Will Find

Configuration Reference

YAML field reference for default.yaml and tiny_chat.yaml

CLI Reference

Flags and Hydra overrides for nemotron steps run eval/model_eval

Output Artifacts

eval_results contract and on-disk layout

Tasks Catalog

NeMo Evaluator Launcher task identifiers grouped by family

Troubleshooting

Named error modes from step.toml, with cause and recovery

Concept

What You Will Learn

Concepts

Map of the concept pages and how they relate

Pipeline Overview

Artifact flow from checkpoint through eval/model_eval into eval_results

Endpoint Types And Task Families

Chat versus completions endpoints, and which benchmark families match each one

Tokenizer Alignment

Why log-probability benchmarks need a tokenizer that matches the served model

Before You Start#

  • The Nemotron repository is synced and uv sync is complete.

  • A bearer token is exported as the environment variable named in target.api_endpoint.api_key_name. Hosted smoke tests usually use NVIDIA_API_KEY.

  • A reachable evaluation endpoint URL and a model identifier the endpoint advertises.

  • A tokenizer that matches the served model when running log-probability tasks. The hosted chat smoke test does not require a tokenizer override.

Limitations And Considerations#

  • Cost: every benchmark sample issues at least one request to the endpoint, and hosted endpoints incur per-token cost.

  • Rate limits: hosted endpoints throttle concurrent requests, so set evaluation.nemo_evaluator_config.config.params.parallelism to a value the endpoint can serve.

  • Deployment: tiny_chat.yaml targets an already-deployed endpoint; default.yaml uses launcher-managed deployment for a Megatron Bridge checkpoint.

  • Comparability: scores are comparable when the endpoint type, task version, tokenizer, and generation parameters are held constant across runs. The Comparing Runs section explains the framing.