About Model Evaluation#
The eval/model_eval Nemotron step is a wrapper around NeMo Evaluator Launcher.
It runs launcher tasks against either an existing OpenAI-compatible endpoint or a launcher-managed Megatron Bridge checkpoint deployment, then writes an eval_results artifact to disk.
Tip
New to model evaluation or the Nemotron CLI? Read Use The Model Evaluation Skill With Confidence for a short guide to productive agent sessions, then start the Getting Started With Model Evaluation tutorial to run one benchmark on one sample against a hosted endpoint.
When To Use#
Use eval/model_eval when the work matches one of the following.
Score a trained checkpoint with NeMo Evaluator Launcher tasks.
Compare a new training run against a baseline by running the same task set against both, with generation parameters and endpoint type held constant.
Perform a sample run against a hosted endpoint, to confirm the URL, credential, and model id before scaling up.
Pair this step with a baseline evaluation before training to capture before-and-after measurements around a training change, by following Comparing Runs.
Pipeline At A Glance#
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333', 'clusterBkg': '#ffffff', 'clusterBorder': '#333333'}}}%%
flowchart LR
ckpt["Hugging Face or<br/>Megatron Bridge checkpoint"] --> deploy["OpenAI-compatible<br/>endpoint"]
hosted["Hosted endpoint"] --> deploy
deploy --> step["eval/model_eval<br/>(NeMo Evaluator)"]
step --> results["eval_results<br/>per-benchmark subdirs"]
NeMo Evaluator Launcher owns task execution and result files under output_dir.
For the contract and the on-disk layout, refer to Output Artifacts.
How It Works#
The runner reads a single YAML document, applies command-line overrides, removes Nemotron-only keys, saves the resolved launcher config, and calls nemo_evaluator_launcher.api.functional.run_eval.
The endpoint type must match the benchmark family.
Chat and instruction benchmarks need a chat endpoint.
Log-probability tasks, such as HellaSwag, need a completions endpoint with logprobs support and a tokenizer that matches the served model.
The hosted smoke-test config is tiny_chat.yaml.
The checkpoint-evaluation config is default.yaml.
Generation settings live under evaluation.nemo_evaluator_config.config.params.
For the full concept set behind these design rules, refer to Concepts.
Documentation#
Run one benchmark on one sample against a hosted endpoint, end to end.
Run a productive agent session: opening brief, four required inputs, and how SKILL.md keeps the session focused.
Discover the step, run a hosted evaluation, and evaluate a deployed checkpoint.
YAML schema, command-line flags, output artifact layout, benchmark catalog, and troubleshooting.
Architecture, endpoint and benchmark families, and tokenizer alignment.
All Documentation#
Guide |
What You Will Do |
Time |
|---|---|---|
Run a one-sample evaluation against a hosted endpoint |
15-30 min |
|
Drive |
10 min read |
Guide |
What You Will Do |
|---|---|
List the step, read its contract, and decide whether it applies |
|
Run benchmarks against an already-running endpoint |
|
Pick a deployment path, then point the step at the endpoint |
Reference |
What You Will Find |
|---|---|
YAML field reference for |
|
Flags and Hydra overrides for |
|
|
|
NeMo Evaluator Launcher task identifiers grouped by family |
|
Named error modes from |
Concept |
What You Will Learn |
|---|---|
Map of the concept pages and how they relate |
|
Artifact flow from checkpoint through |
|
Chat versus completions endpoints, and which benchmark families match each one |
|
Why log-probability benchmarks need a tokenizer that matches the served model |
Before You Start#
The Nemotron repository is synced and
uv syncis complete.A bearer token is exported as the environment variable named in
target.api_endpoint.api_key_name. Hosted smoke tests usually useNVIDIA_API_KEY.A reachable evaluation endpoint URL and a model identifier the endpoint advertises.
A tokenizer that matches the served model when running log-probability tasks. The hosted chat smoke test does not require a tokenizer override.
Limitations And Considerations#
Cost: every benchmark sample issues at least one request to the endpoint, and hosted endpoints incur per-token cost.
Rate limits: hosted endpoints throttle concurrent requests, so set
evaluation.nemo_evaluator_config.config.params.parallelismto a value the endpoint can serve.Deployment:
tiny_chat.yamltargets an already-deployed endpoint;default.yamluses launcher-managed deployment for a Megatron Bridge checkpoint.Comparability: scores are comparable when the endpoint type, task version, tokenizer, and generation parameters are held constant across runs. The Comparing Runs section explains the framing.