Getting Started With Model Evaluation#

What You’ll Build: one NeMo Evaluator Launcher result directory for a single hosted chat smoke-test task, written by the eval/model_eval step.

In this tutorial, you will:

  1. Discover the eval/model_eval step from the local catalog.

  2. Inspect the hosted-endpoint sample config, tiny_chat.yaml.

  3. Run a one-sample hosted chat evaluation.

  4. List the result files on disk.

This tutorial requires between 15 and 30 minutes to complete, depending on endpoint latency.

Sample Prompt

Run a one-sample hosted chat evaluation with eval/model_eval and tiny_chat.yaml, then show me the launcher config and result files.

Prerequisites#

  • Run all commands from the repository root.

  • Install the evaluator extra:

    $ uv sync --extra evaluator
    
  • A reachable OpenAI-compatible chat-completions endpoint.

  • A model identifier advertised by that endpoint.

  • A bearer token exported as the environment variable referenced by target.api_endpoint.api_key_name.

About The Sample Configuration#

The hosted chat sample file is at src/nemotron/steps/eval/model_eval/config/tiny_chat.yaml. It sets deployment.type: none, points NeMo Evaluator Launcher at target.api_endpoint, and runs the chat-compatible mmlu_instruct task with limit_samples: 1.

# Tiny hosted chat endpoint smoke-test config.
#
# Export endpoint settings before running:
#   export NEMO_EVALUATOR_MODEL_ID=<exact model id>
#   export NEMO_EVALUATOR_MODEL_URL=<OpenAI-compatible chat completions endpoint URL>
#   export NEMO_EVALUATOR_API_KEY_NAME=NVIDIA_API_KEY
#   export NEMO_EVALUATOR_ENDPOINT_TYPE=chat

dry_run: false
output_dir: ./results-tiny-chat
task_filters: null

execution:
  type: local
  mode: sequential
  output_dir: ${output_dir}

deployment:
  type: none

target:
  api_endpoint:
    model_id: ${oc.env:NEMO_EVALUATOR_MODEL_ID,''}
    url: ${oc.env:NEMO_EVALUATOR_MODEL_URL,''}
    api_key_name: ${oc.env:NEMO_EVALUATOR_API_KEY_NAME,NVIDIA_API_KEY}
    type: ${oc.env:NEMO_EVALUATOR_ENDPOINT_TYPE,chat}

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0
        top_p: 1.0
        max_new_tokens: 1024
        max_retries: 5
        parallelism: 1
        request_timeout: 3600
        limit_samples: 1
    target:
      api_endpoint:
        adapter_config:
          output_dir: /results
          use_progress_tracking: false
          use_caching: true
          caching_dir: /results/cache
          use_response_logging: true
          max_logged_responses: 5
          use_request_logging: true
          max_logged_requests: 5
  tasks:
    - name: mmlu_instruct

Procedure#

  1. Clone the repository, if you haven’t already:

    $ git clone https://github.com/NVIDIA-NeMo/Nemotron && cd Nemotron
    
  2. Synchronize dependencies:

    $ uv sync --extra evaluator
    
  3. Export the endpoint values. EVAL_ROOT is a directory you choose; it is the parent of the per-run output_dir.

    $ export NVIDIA_API_KEY="<your-api-key>"
    $ export NEMO_EVALUATOR_MODEL_URL="<full-chat-completions-url>"
    $ export NEMO_EVALUATOR_MODEL_ID="<model-identifier-from-the-endpoint>"
    $ export NEMO_EVALUATOR_ENDPOINT_TYPE=chat
    $ export EVAL_ROOT="$(pwd)/output/eval-getting-started"
    
  4. Confirm that the local catalog exposes eval/model_eval.

    $ uv run --no-sync nemotron steps show eval/model_eval
    
  5. Run the hosted chat smoke test.

    $ uv run --no-sync nemotron steps run eval/model_eval \
        -c tiny_chat \
        output_dir="$EVAL_ROOT/results-tiny-chat" \
        target.api_endpoint.url="$NEMO_EVALUATOR_MODEL_URL" \
        target.api_endpoint.model_id="$NEMO_EVALUATOR_MODEL_ID" \
        target.api_endpoint.api_key_name=NVIDIA_API_KEY \
        target.api_endpoint.type=chat \
        evaluation.nemo_evaluator_config.config.params.limit_samples=1
    

    The step writes the launcher config path to stdout. If NeMo Evaluator Launcher returns an invocation id, the step also prints status_command and logs_command values that you can run to inspect the job. Treat those commands as part of the run: wait until the launcher reports a terminal status before expecting final metric artifacts.

    To inspect the merged Nemotron job config without invoking the launcher, add --dry-run. To pass NeMo Evaluator Launcher’s own dry-run flag, use the config override dry_run=true.

  6. List the files written under the output directory after the launcher job reaches a terminal status.

    $ find "$EVAL_ROOT/results-tiny-chat" -maxdepth 5 -type f | sort
    

    The exact file names are owned by NeMo Evaluator Launcher and can vary by task version.

Next Steps#