CLI Reference#

This page documents the CLI surface for nemotron steps run eval/model_eval. The flags are shared by every Nemotron step. The override examples are specific to the eval/model_eval YAML schema.

Syntax#

uv run nemotron steps run eval/model_eval [FLAGS] [HYDRA_OVERRIDES...]

Run the command from the repository root after uv sync --extra evaluator. Pass the configuration name with -c, per-step overrides as key=value dotlists, and optional execution flags.

Flags#

Flag

Long form

Purpose

-c

--config

Config name inside src/nemotron/steps/eval/model_eval/config/, such as default or tiny_chat. Accepts a path to a YAML file.

-r

--run

Attached execution by using an environment profile defined in env.toml.

-b

--batch

Detached execution by using an environment profile defined in env.toml.

-d

--dry-run

Compile the Nemotron job config and exit without dispatching.

--force-squash

Force re-squash of the container image when the selected backend builds one.

Invoking the command without -c resolves the runspec default, default.yaml.

Common Overrides#

Override

Purpose

output_dir=<path>

Base output directory. The runtime also writes this into execution.output_dir before calling NeMo Evaluator Launcher.

dry_run=true

Pass dry-run mode to NeMo Evaluator Launcher. This is different from CLI --dry-run, which only compiles the Nemotron job.

task_filters=[<task>,...]

Optional subset of configured task names passed to NeMo Evaluator Launcher.

target.api_endpoint.url=<url>

OpenAI-compatible endpoint URL for hosted evaluation when deployment.type=none.

target.api_endpoint.model_id=<id>

Exact model id advertised by the hosted endpoint.

target.api_endpoint.api_key_name=<env-var-name>

Name of the environment variable holding the bearer token. This is the variable name, not the secret.

`target.api_endpoint.type=<chat

completions>`

evaluation.nemo_evaluator_config.config.params.limit_samples=<int>

Per-task sample cap for smoke tests.

evaluation.nemo_evaluator_config.config.params.parallelism=<int>

Concurrent requests issued by the evaluator where supported.

evaluation.nemo_evaluator_config.config.params.request_timeout=<int>

Per-request timeout in seconds.

evaluation.nemo_evaluator_config.config.params.extra.tokenizer=<path-or-id>

Tokenizer used by log-probability tasks such as HellaSwag.

deployment.checkpoint_path=<iter_* path>

Megatron Bridge checkpoint path used by default.yaml launcher deployment.

deployment.image=<container>

Container image used by the launcher deployment in default.yaml.

Discovery Commands#

uv run --no-sync nemotron steps list --category eval --json
uv run --no-sync nemotron steps show eval/model_eval --json

nemotron steps show eval/model_eval --json prints the full step contract, including consumes, produces, parameters, strategies, and errors.

Examples#

Hosted Chat Smoke Test#

: "${NVIDIA_API_KEY:?Set NVIDIA_API_KEY}"
: "${NEMO_EVALUATOR_MODEL_URL:?Set the chat-completions endpoint URL}"
: "${NEMO_EVALUATOR_MODEL_ID:?Set the endpoint model id}"

uv run --no-sync nemotron steps run eval/model_eval \
  -c tiny_chat \
  output_dir=./output/eval-tiny-chat \
  target.api_endpoint.url="$NEMO_EVALUATOR_MODEL_URL" \
  target.api_endpoint.model_id="$NEMO_EVALUATOR_MODEL_ID" \
  target.api_endpoint.api_key_name=NVIDIA_API_KEY \
  target.api_endpoint.type=chat \
  evaluation.nemo_evaluator_config.config.params.limit_samples=1

Megatron Checkpoint Evaluation Config#

Use default.yaml when NeMo Evaluator Launcher should deploy a Megatron Bridge checkpoint and then run the configured tasks.

uv run --no-sync nemotron steps run eval/model_eval \
  -c default \
  output_dir=./output/eval-megatron \
  deployment.checkpoint_path=/path/to/checkpoint/iter_0001000 \
  evaluation.nemo_evaluator_config.config.params.limit_samples=1

Compile Without Dispatching#

uv run --no-sync nemotron steps run eval/model_eval -d -c tiny_chat \
  target.api_endpoint.url="$NEMO_EVALUATOR_MODEL_URL" \
  target.api_endpoint.model_id="$NEMO_EVALUATOR_MODEL_ID"

Launcher Dry Run#

uv run --no-sync nemotron steps run eval/model_eval -c tiny_chat dry_run=true