Getting Started With Model Evaluation#

What You’ll Build: one NeMo Evaluator Launcher result directory for a single hosted chat smoke-test task, written by the eval/model_eval step.

In this tutorial, you will:

Discover the eval/model_eval step from the local catalog.
Inspect the hosted-endpoint sample config, tiny_chat.yaml.
Run a one-sample hosted chat evaluation.
List the result files on disk.

This tutorial requires between 15 and 30 minutes to complete, depending on endpoint latency.

Sample Prompt

Run a one-sample hosted chat evaluation with eval/model_eval and tiny_chat.yaml, then show me the launcher config and result files.

Prerequisites#

Run all commands from the repository root.
Install the evaluator extra:
```
$ uv sync --extra evaluator
```
A reachable OpenAI-compatible chat-completions endpoint.
A model identifier advertised by that endpoint.
A bearer token exported as the environment variable referenced by target.api_endpoint.api_key_name.

About The Sample Configuration#

The hosted chat sample file is at src/nemotron/steps/eval/model_eval/config/tiny_chat.yaml. It sets deployment.type: none, points NeMo Evaluator Launcher at target.api_endpoint, and runs the chat-compatible mmlu_instruct task with limit_samples: 1.

# Tiny hosted chat endpoint smoke-test config.
#
# Export endpoint settings before running:
#   export NEMO_EVALUATOR_MODEL_ID=<exact model id>
#   export NEMO_EVALUATOR_MODEL_URL=<OpenAI-compatible chat completions endpoint URL>
#   export NEMO_EVALUATOR_API_KEY_NAME=NVIDIA_API_KEY
#   export NEMO_EVALUATOR_ENDPOINT_TYPE=chat

dry_run: false
output_dir: ./results-tiny-chat
task_filters: null

execution:
  type: local
  mode: sequential
  output_dir: ${output_dir}

deployment:
  type: none

target:
  api_endpoint:
    model_id: ${oc.env:NEMO_EVALUATOR_MODEL_ID,''}
    url: ${oc.env:NEMO_EVALUATOR_MODEL_URL,''}
    api_key_name: ${oc.env:NEMO_EVALUATOR_API_KEY_NAME,NVIDIA_API_KEY}
    type: ${oc.env:NEMO_EVALUATOR_ENDPOINT_TYPE,chat}

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0
        top_p: 1.0
        max_new_tokens: 1024
        max_retries: 5
        parallelism: 1
        request_timeout: 3600
        limit_samples: 1
    target:
      api_endpoint:
        adapter_config:
          output_dir: /results
          use_progress_tracking: false
          use_caching: true
          caching_dir: /results/cache
          use_response_logging: true
          max_logged_responses: 5
          use_request_logging: true
          max_logged_requests: 5
  tasks:
    - name: mmlu_instruct

Procedure#

Clone the repository, if you haven’t already:

$ git clone https://github.com/NVIDIA-NeMo/Nemotron && cd Nemotron

Synchronize dependencies:
```
$ uv sync --extra evaluator
```

Export the endpoint values. EVAL_ROOT is a directory you choose; it is the parent of the per-run output_dir.

$ export NVIDIA_API_KEY="<your-api-key>"
$ export NEMO_EVALUATOR_MODEL_URL="<full-chat-completions-url>"
$ export NEMO_EVALUATOR_MODEL_ID="<model-identifier-from-the-endpoint>"
$ export NEMO_EVALUATOR_ENDPOINT_TYPE=chat
$ export EVAL_ROOT="$(pwd)/output/eval-getting-started"

Confirm that the local catalog exposes eval/model_eval.

$ uv run --no-sync nemotron steps show eval/model_eval

Run the hosted chat smoke test.
```
$ uv run --no-sync nemotron steps run eval/model_eval \
    -c tiny_chat \
    output_dir="$EVAL_ROOT/results-tiny-chat" \
    target.api_endpoint.url="$NEMO_EVALUATOR_MODEL_URL" \
    target.api_endpoint.model_id="$NEMO_EVALUATOR_MODEL_ID" \
    target.api_endpoint.api_key_name=NVIDIA_API_KEY \
    target.api_endpoint.type=chat \
    evaluation.nemo_evaluator_config.config.params.limit_samples=1
```
The step writes the launcher config path to stdout. If NeMo Evaluator Launcher returns an invocation id, the step also prints status_command and logs_command values that you can run to inspect the job. Treat those commands as part of the run: wait until the launcher reports a terminal status before expecting final metric artifacts.

To inspect the merged Nemotron job config without invoking the launcher, add --dry-run. To pass NeMo Evaluator Launcher’s own dry-run flag, use the config override dry_run=true.
List the files written under the output directory after the launcher job reaches a terminal status.
```
$ find "$EVAL_ROOT/results-tiny-chat" -maxdepth 5 -type f | sort
```
The exact file names are owned by NeMo Evaluator Launcher and can vary by task version.

Next Steps#

Run the standard checkpoint-evaluation config: Evaluate A Deployed Checkpoint.
Look up the full YAML schema: Configuration Reference.
Drive the step from a coding agent: Use The Model Evaluation Skill With Confidence.
Run hosted evaluations with custom task settings: Run A Hosted Evaluation.