Getting Started With Model Evaluation#
What You’ll Build: one NeMo Evaluator Launcher result directory for a single hosted chat smoke-test task, written by the eval/model_eval step.
In this tutorial, you will:
Discover the
eval/model_evalstep from the local catalog.Inspect the hosted-endpoint sample config,
tiny_chat.yaml.Run a one-sample hosted chat evaluation.
List the result files on disk.
This tutorial requires between 15 and 30 minutes to complete, depending on endpoint latency.
Run a one-sample hosted chat evaluation with eval/model_eval and tiny_chat.yaml, then show me the launcher config and result files.
Prerequisites#
Run all commands from the repository root.
Install the evaluator extra:
$ uv sync --extra evaluator
A reachable OpenAI-compatible chat-completions endpoint.
A model identifier advertised by that endpoint.
A bearer token exported as the environment variable referenced by
target.api_endpoint.api_key_name.
About The Sample Configuration#
The hosted chat sample file is at src/nemotron/steps/eval/model_eval/config/tiny_chat.yaml.
It sets deployment.type: none, points NeMo Evaluator Launcher at target.api_endpoint, and runs the chat-compatible mmlu_instruct task with limit_samples: 1.
# Tiny hosted chat endpoint smoke-test config.
#
# Export endpoint settings before running:
# export NEMO_EVALUATOR_MODEL_ID=<exact model id>
# export NEMO_EVALUATOR_MODEL_URL=<OpenAI-compatible chat completions endpoint URL>
# export NEMO_EVALUATOR_API_KEY_NAME=NVIDIA_API_KEY
# export NEMO_EVALUATOR_ENDPOINT_TYPE=chat
dry_run: false
output_dir: ./results-tiny-chat
task_filters: null
execution:
type: local
mode: sequential
output_dir: ${output_dir}
deployment:
type: none
target:
api_endpoint:
model_id: ${oc.env:NEMO_EVALUATOR_MODEL_ID,''}
url: ${oc.env:NEMO_EVALUATOR_MODEL_URL,''}
api_key_name: ${oc.env:NEMO_EVALUATOR_API_KEY_NAME,NVIDIA_API_KEY}
type: ${oc.env:NEMO_EVALUATOR_ENDPOINT_TYPE,chat}
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.0
top_p: 1.0
max_new_tokens: 1024
max_retries: 5
parallelism: 1
request_timeout: 3600
limit_samples: 1
target:
api_endpoint:
adapter_config:
output_dir: /results
use_progress_tracking: false
use_caching: true
caching_dir: /results/cache
use_response_logging: true
max_logged_responses: 5
use_request_logging: true
max_logged_requests: 5
tasks:
- name: mmlu_instruct
Procedure#
Clone the repository, if you haven’t already:
$ git clone https://github.com/NVIDIA-NeMo/Nemotron && cd Nemotron
Synchronize dependencies:
$ uv sync --extra evaluator
Export the endpoint values.
EVAL_ROOTis a directory you choose; it is the parent of the per-runoutput_dir.$ export NVIDIA_API_KEY="<your-api-key>" $ export NEMO_EVALUATOR_MODEL_URL="<full-chat-completions-url>" $ export NEMO_EVALUATOR_MODEL_ID="<model-identifier-from-the-endpoint>" $ export NEMO_EVALUATOR_ENDPOINT_TYPE=chat $ export EVAL_ROOT="$(pwd)/output/eval-getting-started"
Confirm that the local catalog exposes
eval/model_eval.$ uv run --no-sync nemotron steps show eval/model_eval
Run the hosted chat smoke test.
$ uv run --no-sync nemotron steps run eval/model_eval \ -c tiny_chat \ output_dir="$EVAL_ROOT/results-tiny-chat" \ target.api_endpoint.url="$NEMO_EVALUATOR_MODEL_URL" \ target.api_endpoint.model_id="$NEMO_EVALUATOR_MODEL_ID" \ target.api_endpoint.api_key_name=NVIDIA_API_KEY \ target.api_endpoint.type=chat \ evaluation.nemo_evaluator_config.config.params.limit_samples=1
The step writes the launcher config path to stdout. If NeMo Evaluator Launcher returns an invocation id, the step also prints
status_commandandlogs_commandvalues that you can run to inspect the job. Treat those commands as part of the run: wait until the launcher reports a terminal status before expecting final metric artifacts.To inspect the merged Nemotron job config without invoking the launcher, add
--dry-run. To pass NeMo Evaluator Launcher’s own dry-run flag, use the config overridedry_run=true.List the files written under the output directory after the launcher job reaches a terminal status.
$ find "$EVAL_ROOT/results-tiny-chat" -maxdepth 5 -type f | sort
The exact file names are owned by NeMo Evaluator Launcher and can vary by task version.
Next Steps#
Run the standard checkpoint-evaluation config: Evaluate A Deployed Checkpoint.
Look up the full YAML schema: Configuration Reference.
Drive the step from a coding agent: Use The Model Evaluation Skill With Confidence.
Run hosted evaluations with custom task settings: Run A Hosted Evaluation.