Configuration Reference#

This page documents the YAML schema consumed by nemotron steps run eval/model_eval. The step is a thin wrapper around NeMo Evaluator Launcher: it loads a YAML config, applies Hydra-style overrides, removes Nemotron-only keys, saves a launcher config, and calls nemo_evaluator_launcher.api.functional.run_eval.

Sample Configs#

Config

Purpose

tiny_chat.yaml

Hosted chat endpoint smoke test. Uses deployment.type: none, target.api_endpoint.*, and one configured task, mmlu_instruct.

default.yaml

Megatron Bridge checkpoint evaluation through NeMo Evaluator Launcher. Uses launcher-managed execution, deployment, evaluation, and tasks sections.

Top-Level Keys#

# Tiny hosted chat endpoint smoke-test config.
#
# Export endpoint settings before running:
#   export NEMO_EVALUATOR_MODEL_ID=<exact model id>
#   export NEMO_EVALUATOR_MODEL_URL=<OpenAI-compatible chat completions endpoint URL>
#   export NEMO_EVALUATOR_API_KEY_NAME=NVIDIA_API_KEY
#   export NEMO_EVALUATOR_ENDPOINT_TYPE=chat

dry_run: false
output_dir: ./results-tiny-chat
task_filters: null

execution:
  type: local
  mode: sequential
  output_dir: ${output_dir}

deployment:
  type: none

target:
  api_endpoint:
    model_id: ${oc.env:NEMO_EVALUATOR_MODEL_ID,''}
    url: ${oc.env:NEMO_EVALUATOR_MODEL_URL,''}
    api_key_name: ${oc.env:NEMO_EVALUATOR_API_KEY_NAME,NVIDIA_API_KEY}
    type: ${oc.env:NEMO_EVALUATOR_ENDPOINT_TYPE,chat}

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0
        top_p: 1.0
        max_new_tokens: 1024
        max_retries: 5
        parallelism: 1
        request_timeout: 3600
        limit_samples: 1
    target:
      api_endpoint:
        adapter_config:
          output_dir: /results
          use_progress_tracking: false
          use_caching: true
          caching_dir: /results/cache
          use_response_logging: true
          max_logged_responses: 5
          use_request_logging: true
          max_logged_requests: 5
  tasks:
    - name: mmlu_instruct
# Standard NeMo Evaluator Launcher config for Megatron checkpoint evaluation.
#
# This mirrors the Nano3/Super3 eval shape: the `run` section is used by
# Nemotron for env/profile/artifact interpolation, then removed before handing
# the config to NeMo Evaluator Launcher.

dry_run: false
output_dir: ./results

run:
  # Use a concrete Megatron Bridge iter_* checkpoint via
  # `deployment.checkpoint_path=...`, or keep this as a W&B artifact reference
  # consumed by `${art:model,path}`.
  model: model:latest
  env:
    executor: local
    container_image: nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
    host: ${oc.env:HOSTNAME,localhost}
    user: ${oc.env:USER,''}
    account: null
    partition: null
    remote_job_dir: ${oc.env:PWD}/.nemotron
    time: "04:00:00"
  wandb:
    entity: null
    project: null

execution:
  type: ${run.env.executor}
  hostname: ${run.env.host}
  username: ${run.env.user}
  account: ${run.env.account}
  partition: ${run.env.partition}
  output_dir: ${output_dir}
  walltime: ${run.env.time}
  num_nodes: ${oc.select:run.env.nodes,1}
  deployment:
    n_tasks: ${execution.num_nodes}
  auto_export:
    destinations:
      - wandb
  env_vars:
    deployment:
      HF_HOME: ${run.env.remote_job_dir}/hf
      HF_TOKEN: HF_TOKEN
      NIM_CACHE_PATH: ${run.env.remote_job_dir}/nim
      VLLM_CACHE_ROOT: ${run.env.remote_job_dir}/vllm
    evaluation:
      HF_HOME: ${run.env.remote_job_dir}/hf
      HF_TOKEN: HF_TOKEN
  mounts:
    deployment: {}
    evaluation: {}
    mount_home: false

deployment:
  type: generic
  image: ${run.env.container_image}
  checkpoint_path: ${art:model,path}
  multiple_instances: false
  port: 1235
  served_model_name: nemo-model
  health_check_path: /v1/health
  command: >-
    bash -c 'export TRITON_CACHE_DIR=/tmp/triton_cache;
    python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py
    --megatron_checkpoint /checkpoint/
    --num_gpus ${oc.select:run.env.gpus_per_node,1}
    --tensor_model_parallel_size 1
    --expert_model_parallel_size 1
    --port 1235
    --num_replicas 1'
  endpoints:
    chat: /v1/chat/completions/
    completions: /v1/completions/
    health: /v1/health

evaluation:
  nemo_evaluator_config:
    config:
      params:
        max_retries: 5
        parallelism: 4
        request_timeout: 6000
        limit_samples: null
        extra:
          tokenizer: ${deployment.checkpoint_path}/tokenizer
          tokenizer_backend: huggingface
    target:
      api_endpoint:
        adapter_config:
          output_dir: /results
          use_progress_tracking: false
          use_caching: true
          caching_dir: /results/cache
          use_response_logging: true
          max_logged_responses: 10
          use_request_logging: true
          max_logged_requests: 10
  tasks:
    - name: adlr_mmlu
      nemo_evaluator_config:
        config:
          params:
            top_p: 0.0
    - name: hellaswag

export:
  wandb:
    entity: ${run.wandb.entity}
    project: ${run.wandb.project}

Key

Used By

Purpose

dry_run

Nemotron runtime

Passed to NeMo Evaluator Launcher as run_eval(..., dry_run=...).

output_dir

Nemotron runtime

Copied into execution.output_dir before launcher dispatch.

task_filters

Nemotron runtime

Optional task-name subset passed to launcher.

run

Nemotron runtime

Nemotron-side artifact, environment, and W&B interpolation. Removed before launcher dispatch.

execution

NeMo Evaluator Launcher

Where and how launcher execution runs.

deployment

NeMo Evaluator Launcher

How the evaluated model is deployed, or type: none for an existing endpoint.

target

NeMo Evaluator Launcher

Existing API endpoint metadata for hosted evaluation.

evaluation

NeMo Evaluator Launcher

Evaluator config, generation params, logging, caching, and adapter settings.

tasks

NeMo Evaluator Launcher

Exact task entries to run. Each entry has a name.

export

NeMo Evaluator Launcher

Optional export settings, such as W&B export.

Hosted Endpoint Fields#

Use these fields with tiny_chat.yaml or any config that sets deployment.type: none.

Field

Purpose

target.api_endpoint.model_id

Exact model id advertised by the endpoint.

target.api_endpoint.url

Full OpenAI-compatible endpoint URL, including /v1/chat/completions or /v1/completions.

target.api_endpoint.api_key_name

Environment variable name that holds the bearer token. Never put the secret value in config.

target.api_endpoint.type

Endpoint type, usually chat for hosted chat smoke tests.

The tiny_chat.yaml file reads these values from NEMO_EVALUATOR_MODEL_ID, NEMO_EVALUATOR_MODEL_URL, NEMO_EVALUATOR_API_KEY_NAME, and NEMO_EVALUATOR_ENDPOINT_TYPE.

Evaluation Params#

Generation and evaluator controls live under:

evaluation.nemo_evaluator_config.config.params

Common fields are:

Field

Purpose

temperature

Sampling temperature for generation tasks.

top_p

Top-p nucleus sampling.

max_new_tokens

Maximum generated tokens for chat/instruction tasks.

max_retries

Request retry count.

parallelism

Request concurrency where supported.

request_timeout

Per-request timeout in seconds.

limit_samples

Optional per-task sample cap. Use 1 for smoke tests.

extra.tokenizer

Tokenizer path or Hugging Face id required by log-probability tasks.

extra.tokenizer_backend

Tokenizer backend, usually huggingface.

Tasks#

Tasks are NeMo Evaluator Launcher task entries. Use exact task IDs from the installed launcher, for example:

nemo-evaluator-launcher ls tasks
nemo-evaluator-launcher ls task mmlu_instruct

The sample configs define these starting points.

Config

Tasks

tiny_chat.yaml

mmlu_instruct

default.yaml

adlr_mmlu, hellaswag

Do not prepend a harness name unless the launcher lists that exact dotted task id.

Checkpoint Deployment Fields#

The default.yaml config uses launcher-managed deployment for a Megatron Bridge checkpoint. The most common override is:

deployment.checkpoint_path=/path/to/iter_0001000

Use the concrete iter_* checkpoint directory, not just the parent training output directory. For log-probability tasks, keep the tokenizer aligned with the deployed checkpoint through evaluation.nemo_evaluator_config.config.params.extra.tokenizer.

Validation Behavior#

Nemotron does not implement a separate benchmark loop for this step. It validates only enough to build the launcher config and import NeMo Evaluator Launcher. Endpoint checks, task validation, result writing, and launcher invocation state are owned by NeMo Evaluator Launcher.