Discover The Model Evaluation Step#

This guide shows how to find eval/model_eval in the step catalog, how to read its contract, and how to decide whether it applies.

Prerequisites#

The Nemotron repository is synced.
A local checkout is sufficient; discovery reads local step.toml files only.

List Eval-Category Steps#

uv run --no-sync nemotron steps list --category eval --json

The response includes eval/model_eval, the step that wraps NeMo Evaluator Launcher.

Inspect The Step Contract#

uv run --no-sync nemotron steps show eval/model_eval --json

The response contains the fields declared in src/nemotron/steps/eval/model_eval/step.toml.

Field	What It Tells You
`consumes`	Optional input artifact type. This step accepts `checkpoint_megatron`.
`produces`	Output artifact type. This step produces `eval_results`.
`parameters`	Documented knobs such as `target.api_endpoint.*`, `deployment.checkpoint_path`, `task_filters`, and launcher params.
`strategies`	Rules for hosted smoke tests, checkpoint evaluation, endpoint/task pairing, and task-name selection.
`errors`	Named failure modes and recovery guidance.
`reference`	Upstream NeMo Evaluator Launcher references.

Read The Sample Files#

The step provides two config files under src/nemotron/steps/eval/model_eval/config/.

# Tiny hosted chat endpoint smoke-test config.
#
# Export endpoint settings before running:
#   export NEMO_EVALUATOR_MODEL_ID=<exact model id>
#   export NEMO_EVALUATOR_MODEL_URL=<OpenAI-compatible chat completions endpoint URL>
#   export NEMO_EVALUATOR_API_KEY_NAME=NVIDIA_API_KEY
#   export NEMO_EVALUATOR_ENDPOINT_TYPE=chat

dry_run: false
output_dir: ./results-tiny-chat
task_filters: null

execution:
  type: local
  mode: sequential
  output_dir: ${output_dir}

deployment:
  type: none

target:
  api_endpoint:
    model_id: ${oc.env:NEMO_EVALUATOR_MODEL_ID,''}
    url: ${oc.env:NEMO_EVALUATOR_MODEL_URL,''}
    api_key_name: ${oc.env:NEMO_EVALUATOR_API_KEY_NAME,NVIDIA_API_KEY}
    type: ${oc.env:NEMO_EVALUATOR_ENDPOINT_TYPE,chat}

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0
        top_p: 1.0
        max_new_tokens: 1024
        max_retries: 5
        parallelism: 1
        request_timeout: 3600
        limit_samples: 1
    target:
      api_endpoint:
        adapter_config:
          output_dir: /results
          use_progress_tracking: false
          use_caching: true
          caching_dir: /results/cache
          use_response_logging: true
          max_logged_responses: 5
          use_request_logging: true
          max_logged_requests: 5
  tasks:
    - name: mmlu_instruct

tiny_chat.yaml is the hosted chat smoke-test config. It sets deployment.type: none, reads target.api_endpoint.* from environment variables, and runs mmlu_instruct with limit_samples: 1.

# Standard NeMo Evaluator Launcher config for Megatron checkpoint evaluation.
#
# This mirrors the Nano3/Super3 eval shape: the `run` section is used by
# Nemotron for env/profile/artifact interpolation, then removed before handing
# the config to NeMo Evaluator Launcher.

dry_run: false
output_dir: ./results

run:
  # Use a concrete Megatron Bridge iter_* checkpoint via
  # `deployment.checkpoint_path=...`, or keep this as a W&B artifact reference
  # consumed by `${art:model,path}`.
  model: model:latest
  env:
    executor: local
    container_image: nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
    host: ${oc.env:HOSTNAME,localhost}
    user: ${oc.env:USER,''}
    account: null
    partition: null
    remote_job_dir: ${oc.env:PWD}/.nemotron
    time: "04:00:00"
  wandb:
    entity: null
    project: null

execution:
  type: ${run.env.executor}
  hostname: ${run.env.host}
  username: ${run.env.user}
  account: ${run.env.account}
  partition: ${run.env.partition}
  output_dir: ${output_dir}
  walltime: ${run.env.time}
  num_nodes: ${oc.select:run.env.nodes,1}
  deployment:
    n_tasks: ${execution.num_nodes}
  auto_export:
    destinations:
      - wandb
  env_vars:
    deployment:
      HF_HOME: ${run.env.remote_job_dir}/hf
      HF_TOKEN: HF_TOKEN
      NIM_CACHE_PATH: ${run.env.remote_job_dir}/nim
      VLLM_CACHE_ROOT: ${run.env.remote_job_dir}/vllm
    evaluation:
      HF_HOME: ${run.env.remote_job_dir}/hf
      HF_TOKEN: HF_TOKEN
  mounts:
    deployment: {}
    evaluation: {}
    mount_home: false

deployment:
  type: generic
  image: ${run.env.container_image}
  checkpoint_path: ${art:model,path}
  multiple_instances: false
  port: 1235
  served_model_name: nemo-model
  health_check_path: /v1/health
  command: >-
    bash -c 'export TRITON_CACHE_DIR=/tmp/triton_cache;
    python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py
    --megatron_checkpoint /checkpoint/
    --num_gpus ${oc.select:run.env.gpus_per_node,1}
    --tensor_model_parallel_size 1
    --expert_model_parallel_size 1
    --port 1235
    --num_replicas 1'
  endpoints:
    chat: /v1/chat/completions/
    completions: /v1/completions/
    health: /v1/health

evaluation:
  nemo_evaluator_config:
    config:
      params:
        max_retries: 5
        parallelism: 4
        request_timeout: 6000
        limit_samples: null
        extra:
          tokenizer: ${deployment.checkpoint_path}/tokenizer
          tokenizer_backend: huggingface
    target:
      api_endpoint:
        adapter_config:
          output_dir: /results
          use_progress_tracking: false
          use_caching: true
          caching_dir: /results/cache
          use_response_logging: true
          max_logged_responses: 10
          use_request_logging: true
          max_logged_requests: 10
  tasks:
    - name: adlr_mmlu
      nemo_evaluator_config:
        config:
          params:
            top_p: 0.0
    - name: hellaswag

export:
  wandb:
    entity: ${run.wandb.entity}
    project: ${run.wandb.project}

default.yaml is the Megatron Bridge checkpoint evaluation config. It uses NeMo Evaluator Launcher deployment and evaluates the configured tasks entries.

Decide Whether It Applies#

eval/model_eval applies when the following statements are true.

The model is already available as an OpenAI-compatible endpoint, or NeMo Evaluator Launcher can deploy the checkpoint from the selected config.
The tasks you need are implemented by the installed NeMo Evaluator Launcher stack.
The endpoint type matches the selected task family.

eval/model_eval is not the right step when the evaluation needs a custom scorer that NeMo Evaluator Launcher does not implement. Write a dedicated evaluation step in that case, modeled on the contract layout under src/nemotron/steps/.