Quickstart#
Get up and running with NeMo Evaluator in minutes. Choose your preferred approach based on your needs and experience level.
Prerequisites#
All paths require:
OpenAI-compatible endpoint (hosted or self-deployed)
Valid API key for your chosen endpoint
Quick Reference#
Task |
Command |
|---|---|
List benchmarks |
|
Run evaluation |
|
Check status |
|
Job info |
|
Export results |
|
Dry run |
Add |
Test with limited samples |
Add |
Choose Your Path#
Select the approach that best matches your workflow and technical requirements:
Recommended for most users
Unified CLI experience with automated container management, built-in orchestration, and result export capabilities.
For Python developers
Programmatic control with full adapter features, custom configurations, and direct API access for integration into existing workflows.
For NeMo Framework Users
End-to-end training and evaluation of large language models (LLMs).
For container workflows
Direct container execution with volume mounting, environment control, and integration into Docker-based CI/CD pipelines.
Model Endpoints#
NeMo Evaluator works with any OpenAI-compatible endpoint. You have several options:
Hosted Endpoints (Recommended)#
NVIDIA Build: build.nvidia.com - Ready-to-use hosted models
OpenAI: Standard OpenAI API endpoints
Other providers: Anthropic, Cohere, or any OpenAI-compatible API
Self-Hosted Options#
If you prefer to host your own models, verify OpenAI compatibility using our Testing Endpoint Compatibility guide.
If you are deploying the model locally with Docker, you can use a dedicated docker network. This will provide a secure connetion between deployment and evaluation docker containers.
# create a dedicated docker network
docker network create my-custom-network
# launch deployment
docker run --gpus all --network my-custom-network --name my-phi-container vllm/vllm-openai:latest \
--model microsoft/Phi-4-mini-instruct --max-model-len 8192
# Or use other serving frameworks
# TRT-LLM, NeMo Framework, etc.
Create an evaluation config:
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: my_phi_test
extra_docker_args: "--network my-custom-network" # same network as used for deployment
target:
api_endpoint:
model_id: microsoft/Phi-4-mini-instruct
url: http://my-phi-container:8000/v1/chat/completions
api_key_name: null
evaluation:
tasks:
- name: simple_evals.mmlu_pro
nemo_evaluator_config:
config:
params:
limit_samples: 10 # TEST ONLY: Limits to 10 samples for quick testing
parallelism: 1
Save the config to a file (e.g. phi-eval.yaml) and launch the evaluation:
nemo-evaluator-launcher run \
--config ./phi-eval.yaml \
-o execution.output_dir=./phi-results
<!-- See {ref}`deployment-overview` for detailed deployment options. -->
## Validation and Troubleshooting
### Quick Validation Steps
Before running full evaluations, verify your setup:
```bash
# 1. Test your endpoint connectivity
export NGC_API_KEY=nvapi-...
curl -X POST "https://integrate.api.nvidia.com/v1/chat/completions" \
-H "Authorization: Bearer $NGC_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 10
}'
# 2. Run a dry-run to validate configuration
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_llama_3_1_8b_instruct.yaml \
--dry-run
# 3. Run a minimal test with very few samples
nemo-evaluator-launcher run \
--config packages/nemo-evaluator-launcher/examples/local_llama_3_1_8b_instruct.yaml \
-o +config.params.limit_samples=1 \
-o execution.output_dir=./test_results
Common Issues and Solutions#
# Verify your API key is set correctly
echo $NGC_API_KEY
# Test with a simple curl request (see above)
# Check Docker is running and has GPU access
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi
# Pull the latest container if you have issues
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.10
# Enable debug logging
export LOG_LEVEL=DEBUG
# Check available evaluation types
nemo-evaluator-launcher ls tasks
# Check if results were generated
find ./results -name "*.yml" -type f
# View task results
cat ./results/<invocation_id>/<task_name>/artifacts/results.yml
# Or export and view processed results
nemo-evaluator-launcher export <invocation_id> --dest local --format json
cat ./results/<invocation_id>/processed_results.json
Next Steps#
After completing your quickstart:
# List all available tasks
nemo-evaluator-launcher ls tasks
# Run with limited samples for quick testing
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/local_limit_samples.yaml
# Export to MLflow
nemo-evaluator-launcher export <invocation_id> --dest mlflow
# Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb
# Export to Google Sheets
nemo-evaluator-launcher export <invocation_id> --dest gsheets
# Export to local files
nemo-evaluator-launcher export <invocation_id> --dest local --format json
cd packages/nemo-evaluator-launcher
# Run on Slurm cluster
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/slurm_llama_3_1_8b_instruct.yaml
# Run on Lepton AI
nemo-evaluator-launcher run --config packages/nemo-evaluator-launcher/examples/lepton_vllm_llama_3_1_8b_instruct.yaml