Quickstart#

Get up and running with NeMo Evaluator in minutes. Choose your preferred approach based on your needs and experience level.

Prerequisites#

All paths require:

OpenAI-compatible endpoint (hosted or self-deployed)
Valid API key for your chosen endpoint

Quick Reference#

Task	Command
List benchmarks	`nemo-evaluator-launcher ls tasks`
Run evaluation	`nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/<config>.yaml`
Check status	`nemo-evaluator-launcher status <invocation_id>`
Job info	`nemo-evaluator-launcher info <invocation_id>`
Export results	`nemo-evaluator-launcher export <invocation_id> --dest local --format json`
Dry run	Add `--dry-run` to any run command
Test with limited samples	Add `-o +config.params.limit_samples=3`

Choose Your Path#

Select the approach that best matches your workflow and technical requirements:

NeMo Evaluator Launcher

Recommended for most users

Unified CLI experience with automated container management, built-in orchestration, and result export capabilities.

NeMo Evaluator Launcher

NeMo Evaluator Core

For Python developers

Programmatic control with full adapter features, custom configurations, and direct API access for integration into existing workflows.

NeMo Evaluator Core

NeMo Framework Container

For NeMo Framework Users

End-to-end training and evaluation of large language models (LLMs).

Evaluate checkpoints trained by NeMo Framework

Container Direct

For container workflows

Direct container execution with volume mounting, environment control, and integration into Docker-based CI/CD pipelines.

Container Direct

Model Endpoints#

NeMo Evaluator works with any OpenAI-compatible endpoint. You have several options:

Hosted Endpoints (Recommended)#

NVIDIA Build: build.nvidia.com - Ready-to-use hosted models
OpenAI: Standard OpenAI API endpoints
Other providers: Anthropic, Cohere, or any OpenAI-compatible API

Self-Hosted Options#

If you prefer to host your own models, verify OpenAI compatibility using our Testing Endpoint Compatibility guide.

If you are deploying the model locally with Docker, you can use a dedicated docker network. This will provide a secure connetion between deployment and evaluation docker containers.

# create a dedicated docker network
docker network create my-custom-network

# launch deployment
docker run --gpus all --network my-custom-network --name my-phi-container vllm/vllm-openai:latest \
    --model microsoft/Phi-4-mini-instruct --max-model-len 8192

# Or use other serving frameworks
# TRT-LLM, NeMo Framework, etc.

Create an evaluation config:

defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: my_phi_test
  extra_docker_args: "--network my-custom-network"  # same network as used for deployment

target:
  api_endpoint:
    model_id: microsoft/Phi-4-mini-instruct
    url: http://my-phi-container:8000/v1/chat/completions
    api_key_name: null

evaluation:
  tasks:
    - name: simple_evals.mmlu_pro
      nemo_evaluator_config:
        config:
          params:
            limit_samples: 10 # TEST ONLY: Limits to 10 samples for quick testing
            parallelism: 1

Save the config to a file (e.g. phi-eval.yaml) and launch the evaluation:

nemo-evaluator-launcher run \
    --config ./phi-eval.yaml \
    -o execution.output_dir=./phi-results


<!-- See {ref}`deployment-overview` for detailed deployment options. -->

## Validation and Troubleshooting

### Quick Validation Steps

Before running full evaluations, verify your setup:

```bash
# 1. Test your endpoint connectivity
export NGC_API_KEY=nvapi-...
curl -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \
    -H "Authorization: Bearer $NGC_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta/llama-3.1-8b-instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 10
    }'

# 2. Run a dry-run to validate configuration
nemo-evaluator-launcher run \
    --config packages/nemo-evaluator-launcher/examples/local_llama_3_1_8b_instruct.yaml \
    --dry-run

# 3. Run a minimal test with very few samples
nemo-evaluator-launcher run \
    --config packages/nemo-evaluator-launcher/examples/local_llama_3_1_8b_instruct.yaml \
    -o +config.params.limit_samples=1 \
    -o execution.output_dir=./test_results

Common Issues and Solutions#

API Key Issues

# Verify your API key is set correctly
echo $NGC_API_KEY

# Test with a simple curl request (see above)

Container Issues

# Check Docker is running and has GPU access
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi

# Pull the latest container if you have issues
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.10

Configuration Issues

# Enable debug logging
export LOG_LEVEL=DEBUG

# Check available evaluation types
nemo-evaluator-launcher ls tasks

Result Validation

# Check if results were generated
find ./results -name "*.yml" -type f

# View task results
cat ./results/<invocation_id>/<task_name>/artifacts/results.yml

# Or export and view processed results
nemo-evaluator-launcher export <invocation_id> --dest local --format json
cat ./results/<invocation_id>/processed_results.json

Next Steps#

After completing your quickstart:

Explore More Benchmarks

# List all available tasks
nemo-evaluator-launcher ls tasks

# Run with limited samples for quick testing
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_limit_samples.yaml

Export Results

# Export to MLflow
nemo-evaluator-launcher export <invocation_id> --dest mlflow

# Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb

# Export to Google Sheets
nemo-evaluator-launcher export <invocation_id> --dest gsheets

# Export to local files
nemo-evaluator-launcher export <invocation_id> --dest local --format json

Scale to Clusters

cd packages/nemo-evaluator-launcher
# Run on Slurm cluster
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/slurm_llama_3_1_8b_instruct.yaml

# Run on Lepton AI
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_vllm_llama_3_1_8b_instruct.yaml