Frequently Asked Questions#

What benchmarks and harnesses are supported?#

The docs list hundreds of benchmarks across multiple harnesses, available via curated NGC evaluation containers and the unified Launcher.

Reference: About Selecting Benchmarks

Tip

Discover available tasks with

nemo-evaluator-launcher ls tasks

How do I set log dir and verbose logging?#

Set these environment variables for logging configuration:

# Set log level (INFO, DEBUG, WARNING, ERROR, CRITICAL)
export LOG_LEVEL=DEBUG
# or (legacy, still supported)
export NEMO_EVALUATOR_LOG_LEVEL=DEBUG

Reference: Logging Configuration.


Can I run distributed or on a scheduler?#

Yes. Launcher supports multiple executors. For optimal performance, the SLURM executor is recommended. It schedules and executes jobs across cluster nodes, enabling parallel, large‑scale evaluation runs while preserving reproducibility via containerized benchmarks.

See Slurm Executor for details.


Can I point Evaluator at my own endpoint?#

Yes. Provide your OpenAI‑compatible endpoint. The “none” deployment option means no model deployment is performed as part of the evaluation job. Instead, you provide an existing OpenAI-compatible endpoint. The launcher handles running evaluation tasks while connecting to your existing endpoint.

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct    # Model identifier (required)
    url: https://your-endpoint.com/v1/chat/completions  # Endpoint URL (required)
    api_key_name: API_KEY                    # Environment variable name (recommended)

Reference: None Deployment.


Can I test my endpoint for OpenAI compatibility?

Yes. Preview the full resolved configuration without executing using --dry-run :

nemo-evaluator-launcher run \
  --config packages/nemo-evaluator-launcher/examples/local_basic.yaml --dry-run

Reference: Dry Run.


Can I store and retrieve per-sample results, not just the summary?#

Yes. Capture full request/response artifacts and retrieve them from the run’s artifacts folder.

Enable detailed logging with nemo_evaluator_config:

evaluation:
  # Request + response logging (example at 1k each)
  nemo_evaluator_config:
    target:
      api_endpoint:
        adapter_config:
          use_request_logging: True
          max_saved_requests: 1000
          use_response_logging: True
          max_saved_responses: 1000

These enable the RequestLoggingInterceptor and ResponseLoggingInterceptor so each prompt/response pair is saved alongside the evaluation job.

Retrieve artifacts after the run:

nemo-evaluator-launcher export <invocation_id> --dest local --output-dir ./artifacts --copy-logs

Look under ./artifacts/ for results.yml, reports, logs, and saved request/response files.

Reference: Request Logging Interceptor.


Where do I find evaluation results?#

After a run completes, copy artifacts locally:

nemo-evaluator-launcher info <invocation_id> --copy-artifacts ./artifacts

Inside ./artifacts/ you’ll see the run config, results.yaml (main output file), HTML/JSON reports, logs, and cached request/response files, if caching was used.

Where the output is structured:

  <output_dir>/
     ├── eval_factory_metrics.json
     ├── report.html
     ├── report.json
     ├── results.yml
     ├── run_config.yml
     └── <Task specific arifacts>/

Reference: Evaluation Output.


Can I export a consolidated JSON of scores?#

Yes. JSON is included in the standard output exporter, along with automatic exporters for MLflow, Weights & Biases, and Google Sheets.

nemo-evaluator-launcher export <invocation_id> --dest local --format json

This creates processed_results.json (you can also pass multiple invocation IDs to merge).

Exporter docs: Local files, W&B, MLflow, GSheets are listed under Launcher → Exporters in the docs.

Reference: Exporters.


What’s the difference between Launcher and Core?#

  • Launcher (nemo-evaluator-launcher): Unified CLI with config/exec backends (local/Slurm/Lepton), container orchestration, and exporters. Best for most users. See NeMo Evaluator Launcher.

  • Core (nemo-evaluator): Direct access to the evaluation engine and adapters—useful for custom programmatic pipelines and advanced interceptor use. See NeMo Evaluator.


Can I add a new benchmark?#

Yes. Use a Framework Definition File (FDF)—a YAML that declares framework metadata, default commands/params, and one or more evaluation tasks. Minimal flow:

  1. Create an FDF with framework, defaults, and evaluations sections.

  2. Point the launcher/Core at your FDF and run.

  3. (Recommended) Package as a container for reproducibility and shareability. See Extending NeMo Evaluator.

Skeleton FDF (excerpt):

framework:
  name: my-custom-eval
  pkg_name: my_custom_eval
defaults:
  command: >-
    my-eval-cli --model {{target.api_endpoint.model_id}}
                --task {{config.params.task}}
                --output {{config.output_dir}}
evaluations:
  - name: my_task_1
    defaults:
      config:
        params:
          task: my_task_1

See the “Framework Definition File (FDF)” page for the full example and field reference.

Reference: Framework Definition File (FDF).


Why aren’t exporters included in the main wheel?#

Exporters target external systems (e.g., W&B, MLflow, Google Sheets). Each of those adds heavy/optional dependencies and auth integrations. To keep the base install lightweight and avoid forcing unused deps on every user, exporters ship as optional extras:

# Only what you need
pip install "nemo-evaluator-launcher[wandb]"
pip install "nemo-evaluator-launcher[mlflow]"
pip install "nemo-evaluator-launcher[gsheets]"

# Or everything
pip install "nemo-evaluator-launcher[all]"

Exporter docs: Local files, W&B, MLflow, GSheets are listed under Exporters.


How is input configuration managed?#

NeMo Evaluator uses Hydra for configuration management, allowing flexible composition, inheritance, and command-line overrides.

Each evaluation is defined by a YAML configuration file that includes four primary sections:

defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: results

target:
  api_endpoint:
    model_id: meta/llama-3.2-3b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

evaluation:
  - name: gpqa_diamond
  - name: ifeval

This structure defines where to run, how to serve the model, which model or endpoint to evaluate, and what benchmarks to execute.

You can start from a provided example config or compose your own using Hydra’s defaults list to combine deployment, execution, and benchmark modules.

Reference: Configuration.


Can I customize or override configuration values?#

Yes. You can override any field in the YAML file directly from the command line using the -o flag:

# Override output directory
nemo-evaluator-launcher run --config your_config.yaml \
  -o execution.output_dir=my_results

# Override multiple fields
nemo-evaluator-launcher run --config your_config.yaml \
  -o target.api_endpoint.url="https://new-endpoint.com/v1/chat/completions" \
  -o target.api_endpoint.model_id=openai/gpt-4o

Overrides are merged dynamically at runtime—ideal for testing new endpoints, swapping models, or changing output destinations without editing your base config.

Tip

Always start with a dry run to validate your configuration before launching a full evaluation:

nemo-evaluator-launcher run --config your_config.yaml --dry-run

Reference: Configuration.


How do I choose the right deployment and execution configuration?#

NeMo Evaluator separates deployment (how your model is served) from execution (where your evaluations are run). These are configured in the defaults section of your YAML file:

defaults:
  - execution: local      # Where to run: local, lepton, or slurm
  - deployment: none      # How to serve the model: none, vllm, sglang, nim, trtllm, generic

Deployment Options — How your model is served

Option

Description

Best for

none

Uses an existing API endpoint (e.g., NVIDIA API Catalog, OpenAI, Anthropic). No deployment needed.

External APIs or already-deployed services

vllm

High-performance inference server for LLMs with tensor parallelism and caching.

Fast local/cluster inference, production workloads

sglang

Lightweight structured generation server optimized for throughput.

Evaluating structured or long-form text generation

nim

NVIDIA Inference Microservice (NIM) – optimized for enterprise-grade serving with autoscaling and telemetry.

Enterprise, production, and reproducible benchmarks

trtllm

TensorRT-LLM backend using GPU-optimized kernels.

Lowest latency and highest GPU efficiency

generic

Use a custom serving stack of your choice.

Custom frameworks or experimental endpoints

Execution Platforms — Where evaluations run

Platform

Description

Use case

local

Runs Docker-based evaluation locally.

Development, testing, or small-scale benchmarking

lepton

Runs on NVIDIA Lepton for on-demand GPU execution.

Scalable, production-grade evaluations

slurm

Uses your HPC cluster’s job scheduler.

Research clusters or large batch evaluations

Example:

defaults:
  - execution: lepton
  - deployment: vllm

This configuration launches the model with vLLM serving and runs benchmarks remotely on Lepton GPUs.

When in doubt:

  • Use deployment: none + execution: local for your first run (quickest setup).

  • Use vllm or nim once you need scalability and speed.

Always test first:

nemo-evaluator-launcher run --config your_config.yaml --dry-run

Reference: Configuration.


Can I use Evaluator without internet access?#

Yes. NeMo Evaluator uses datasets and model checkpoints from Hugging Face Hub. If a requested dataset or model is not available locally, it is downloaded from the Hub at runtime.

When working in an environment without internet access, configure a cache directory and pre-populate it with all required data before launching the evaluation.

See the example configuration with HF caching:

defaults:
  - execution: slurm/default
  - deployment: vllm
  - _self_

# set required execution arguments
execution:
  hostname: ??? # SLURM headnode (login) hostname (required)
  username: ${oc.env:USER} # Cluster username; defaults to $USER.
  account: ??? # SLURM account allocation (required)
  output_dir: ??? # ABSOLUTE path accessible to SLURM compute nodes (required)

  # override default execution arguments
  walltime: 00:30:00
  partition: backfill

  # use mounts and env vars to load the model and datasets from cache
  mounts:
    # replace /path/to/hf_home with the absolute path to the HF home directory on the cluster
    deployment:
      /path/to/hf_home: /root/.cache/huggingface
    evaluation:
      /path/to/hf_home: /root/.cache/huggingface
    mount_home: false   # don't mount home directory

  env_vars:
    # if the checkpoint and datasets are already cached, you can set
    # HF_DATASETS_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 for offline mode
    deployment:
      HF_DATASETS_OFFLINE: 0
      TRANSFORMERS_OFFLINE: 0
    evaluation:
      HF_DATASETS_OFFLINE: 0
      TRANSFORMERS_OFFLINE: 0

# set up deployment as usual
# the model will be downloaded from Hugging Face during deployment
# and stored in the HF home directory for later use
deployment:
  checkpoint_path: null
  hf_model_handle: meta-llama/Llama-3.1-8B-Instruct
  served_model_name: meta-llama/Llama-3.1-8B-Instruct
  tensor_parallel_size: 1
  data_parallel_size: 8
  extra_args: "--max-model-len 32768"

evaluation:
  tasks:
    - name: lm-evaluation-harness.ifeval  # chat benchmark will automatically use v1/chat/completions endpoint
    - name: gsm8k   # completions benchmark will automatically use v1/completions endpoint

Modify the example with actual paths for the mounts and run:

nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/slurm_vllm_advanced_hf_caching.yaml