Configuration Reference#

This page documents the YAML schema consumed by nemotron steps run eval/model_eval. The step is a thin wrapper around NeMo Evaluator Launcher: it loads a YAML config, applies Hydra-style overrides, removes Nemotron-only keys, saves a launcher config, and calls nemo_evaluator_launcher.api.functional.run_eval.

Sample Configs#

Config	Purpose
`tiny_chat.yaml`	Hosted chat endpoint smoke test. Uses `deployment.type: none`, `target.api_endpoint.*`, and one configured task, `mmlu_instruct`.
`default.yaml`	Megatron Bridge checkpoint evaluation through NeMo Evaluator Launcher. Uses launcher-managed `execution`, `deployment`, `evaluation`, and `tasks` sections.

Top-Level Keys#

# Tiny hosted chat endpoint smoke-test config.
#
# Export endpoint settings before running:
#   export NEMO_EVALUATOR_MODEL_ID=<exact model id>
#   export NEMO_EVALUATOR_MODEL_URL=<OpenAI-compatible chat completions endpoint URL>
#   export NEMO_EVALUATOR_API_KEY_NAME=NVIDIA_API_KEY
#   export NEMO_EVALUATOR_ENDPOINT_TYPE=chat

dry_run: false
output_dir: ./results-tiny-chat
task_filters: null

execution:
  type: local
  mode: sequential
  output_dir: ${output_dir}

deployment:
  type: none

target:
  api_endpoint:
    model_id: ${oc.env:NEMO_EVALUATOR_MODEL_ID,''}
    url: ${oc.env:NEMO_EVALUATOR_MODEL_URL,''}
    api_key_name: ${oc.env:NEMO_EVALUATOR_API_KEY_NAME,NVIDIA_API_KEY}
    type: ${oc.env:NEMO_EVALUATOR_ENDPOINT_TYPE,chat}

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0
        top_p: 1.0
        max_new_tokens: 1024
        max_retries: 5
        parallelism: 1
        request_timeout: 3600
        limit_samples: 1
    target:
      api_endpoint:
        adapter_config:
          output_dir: /results
          use_progress_tracking: false
          use_caching: true
          caching_dir: /results/cache
          use_response_logging: true
          max_logged_responses: 5
          use_request_logging: true
          max_logged_requests: 5
  tasks:
    - name: mmlu_instruct

# Standard NeMo Evaluator Launcher config for Megatron checkpoint evaluation.
#
# This mirrors the Nano3/Super3 eval shape: the `run` section is used by
# Nemotron for env/profile/artifact interpolation, then removed before handing
# the config to NeMo Evaluator Launcher.

dry_run: false
output_dir: ./results

run:
  # Use a concrete Megatron Bridge iter_* checkpoint via
  # `deployment.checkpoint_path=...`, or keep this as a W&B artifact reference
  # consumed by `${art:model,path}`.
  model: model:latest
  env:
    executor: local
    container_image: nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
    host: ${oc.env:HOSTNAME,localhost}
    user: ${oc.env:USER,''}
    account: null
    partition: null
    remote_job_dir: ${oc.env:PWD}/.nemotron
    time: "04:00:00"
  wandb:
    entity: null
    project: null

execution:
  type: ${run.env.executor}
  hostname: ${run.env.host}
  username: ${run.env.user}
  account: ${run.env.account}
  partition: ${run.env.partition}
  output_dir: ${output_dir}
  walltime: ${run.env.time}
  num_nodes: ${oc.select:run.env.nodes,1}
  deployment:
    n_tasks: ${execution.num_nodes}
  auto_export:
    destinations:
      - wandb
  env_vars:
    deployment:
      HF_HOME: ${run.env.remote_job_dir}/hf
      HF_TOKEN: HF_TOKEN
      NIM_CACHE_PATH: ${run.env.remote_job_dir}/nim
      VLLM_CACHE_ROOT: ${run.env.remote_job_dir}/vllm
    evaluation:
      HF_HOME: ${run.env.remote_job_dir}/hf
      HF_TOKEN: HF_TOKEN
  mounts:
    deployment: {}
    evaluation: {}
    mount_home: false

deployment:
  type: generic
  image: ${run.env.container_image}
  checkpoint_path: ${art:model,path}
  multiple_instances: false
  port: 1235
  served_model_name: nemo-model
  health_check_path: /v1/health
  command: >-
    bash -c 'export TRITON_CACHE_DIR=/tmp/triton_cache;
    python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py
    --megatron_checkpoint /checkpoint/
    --num_gpus ${oc.select:run.env.gpus_per_node,1}
    --tensor_model_parallel_size 1
    --expert_model_parallel_size 1
    --port 1235
    --num_replicas 1'
  endpoints:
    chat: /v1/chat/completions/
    completions: /v1/completions/
    health: /v1/health

evaluation:
  nemo_evaluator_config:
    config:
      params:
        max_retries: 5
        parallelism: 4
        request_timeout: 6000
        limit_samples: null
        extra:
          tokenizer: ${deployment.checkpoint_path}/tokenizer
          tokenizer_backend: huggingface
    target:
      api_endpoint:
        adapter_config:
          output_dir: /results
          use_progress_tracking: false
          use_caching: true
          caching_dir: /results/cache
          use_response_logging: true
          max_logged_responses: 10
          use_request_logging: true
          max_logged_requests: 10
  tasks:
    - name: adlr_mmlu
      nemo_evaluator_config:
        config:
          params:
            top_p: 0.0
    - name: hellaswag

export:
  wandb:
    entity: ${run.wandb.entity}
    project: ${run.wandb.project}

Key	Used By	Purpose
`dry_run`	Nemotron runtime	Passed to NeMo Evaluator Launcher as `run_eval(..., dry_run=...)`.
`output_dir`	Nemotron runtime	Copied into `execution.output_dir` before launcher dispatch.
`task_filters`	Nemotron runtime	Optional task-name subset passed to launcher.
`run`	Nemotron runtime	Nemotron-side artifact, environment, and W&B interpolation. Removed before launcher dispatch.
`execution`	NeMo Evaluator Launcher	Where and how launcher execution runs.
`deployment`	NeMo Evaluator Launcher	How the evaluated model is deployed, or `type: none` for an existing endpoint.
`target`	NeMo Evaluator Launcher	Existing API endpoint metadata for hosted evaluation.
`evaluation`	NeMo Evaluator Launcher	Evaluator config, generation params, logging, caching, and adapter settings.
`tasks`	NeMo Evaluator Launcher	Exact task entries to run. Each entry has a `name`.
`export`	NeMo Evaluator Launcher	Optional export settings, such as W&B export.

Hosted Endpoint Fields#

Use these fields with tiny_chat.yaml or any config that sets deployment.type: none.

Field	Purpose
`target.api_endpoint.model_id`	Exact model id advertised by the endpoint.
`target.api_endpoint.url`	Full OpenAI-compatible endpoint URL, including `/v1/chat/completions` or `/v1/completions`.
`target.api_endpoint.api_key_name`	Environment variable name that holds the bearer token. Never put the secret value in config.
`target.api_endpoint.type`	Endpoint type, usually `chat` for hosted chat smoke tests.

The tiny_chat.yaml file reads these values from NEMO_EVALUATOR_MODEL_ID, NEMO_EVALUATOR_MODEL_URL, NEMO_EVALUATOR_API_KEY_NAME, and NEMO_EVALUATOR_ENDPOINT_TYPE.

Evaluation Params#

Generation and evaluator controls live under:

evaluation.nemo_evaluator_config.config.params

Common fields are:

Field	Purpose
`temperature`	Sampling temperature for generation tasks.
`top_p`	Top-p nucleus sampling.
`max_new_tokens`	Maximum generated tokens for chat/instruction tasks.
`max_retries`	Request retry count.
`parallelism`	Request concurrency where supported.
`request_timeout`	Per-request timeout in seconds.
`limit_samples`	Optional per-task sample cap. Use `1` for smoke tests.
`extra.tokenizer`	Tokenizer path or Hugging Face id required by log-probability tasks.
`extra.tokenizer_backend`	Tokenizer backend, usually `huggingface`.

Tasks#

Tasks are NeMo Evaluator Launcher task entries. Use exact task IDs from the installed launcher, for example:

nemo-evaluator-launcher ls tasks
nemo-evaluator-launcher ls task mmlu_instruct

The sample configs define these starting points.

Config	Tasks
`tiny_chat.yaml`	`mmlu_instruct`
`default.yaml`	`adlr_mmlu`, `hellaswag`

Do not prepend a harness name unless the launcher lists that exact dotted task id.

Checkpoint Deployment Fields#

The default.yaml config uses launcher-managed deployment for a Megatron Bridge checkpoint. The most common override is:

deployment.checkpoint_path=/path/to/iter_0001000

Use the concrete iter_* checkpoint directory, not just the parent training output directory. For log-probability tasks, keep the tokenizer aligned with the deployed checkpoint through evaluation.nemo_evaluator_config.config.params.extra.tokenizer.

Validation Behavior#

Nemotron does not implement a separate benchmark loop for this step. It validates only enough to build the launcher config and import NeMo Evaluator Launcher. Endpoint checks, task validation, result writing, and launcher invocation state are owned by NeMo Evaluator Launcher.