NeMo Evaluator Launcher#
Best for: Most users who want a unified CLI experience
The NeMo Evaluator Launcher provides the simplest way to run evaluations with automated container management, built-in orchestration, and comprehensive result export capabilities.
Prerequisites#
OpenAI-compatible endpoint (hosted or self-deployed) and an API key (if the endpoint is gated), below referred as
NGC_API_KEY
in case one uses models hosted under NVIDIA’s serving platformDocker installed (for local execution)
NeMo Evaluator repository cloned (for access to examples)
git clone https://github.com/NVIDIA-NeMo/Evaluator.git
Your Hugging Face token with access to the GPQA-Diamond dataset (click here to request), below referred as
HF_TOKEN_FOR_GPQA_DIAMOND
.
Quick Start#
# 1. Install the launcher
pip install nemo-evaluator-launcher
# Optional: Install with specific exporters
pip install "nemo-evaluator-launcher[all]" # All exporters
pip install "nemo-evaluator-launcher[mlflow]" # MLflow only
pip install "nemo-evaluator-launcher[wandb]" # W&B only
pip install "nemo-evaluator-launcher[gsheets]" # Google Sheets only
# 2. List available benchmarks
nemo-evaluator-launcher ls tasks
# 3. Run evaluation against a hosted endpoint
# Prerequisites: Set your API key and HF token. Visit https://huggingface.co/datasets/Idavidrein/gpqa
# to get access to the gated GPQA dataset for this task.
export NGC_API_KEY=nvapi-...
export HF_TOKEN_FOR_GPQA_DIAMOND=hf_...
# Move into the cloned directory (see above).
cd Evaluator
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o target.api_endpoint.api_key_name=NGC_API_KEY \
-o execution.output_dir=./results
# 4. Check status
nemo-evaluator-launcher status <invocation_id> --json # use the ID printed by the run command
# 5. Find all the recent runs you launched
nemo-evaluator-launcher ls runs --since 2h # list runs from last 2 hours
Note
It is possible to use short version of IDs in status
command, for example abcd
instead of a full abcdef0123456
or ab.0
instead of abcdef0123456.0
, so long as there are no collisions. This is a syntactic sugar allowing for a slightly easier usage.
# 6a. Check the results
cat <job_output_dir>/artifacts/results.yml # use the output_dir printed by the run command
# 6b. Check the running logs
tail -f <job_output_dir>/*/logs/stdout.log # use the output_dir printed by the run command
# 7. Export your results (JSON/CSV)
nemo-evaluator-launcher export <invocation_id> --dest local --format json
# 8. Kill the running job(s)
nemo-evaluator-launcher kill <invocation_id> # use the ID printed by the run command
Complete Working Example#
Here’s a complete example using NVIDIA Build (build.nvidia.com):
# Prerequisites: Set your API key and HF token
export NGC_API_KEY=nvapi-...
export HF_TOKEN_FOR_GPQA_DIAMOND=hf_...
# Run a quick test evaluation with limited samples
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct_limit_samples \
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions \
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
-o target.api_endpoint.api_key_name=NGC_API_KEY \
-o execution.output_dir=./results
What happens:
Pulls appropriate evaluation container
Runs benchmark against your endpoint
Saves results to specified directory
Provides monitoring and status updates
Key Features#
Automated Container Management#
Automatically pulls and manages evaluation containers
Handles volume mounting for results
No manual Docker commands required
Built-in Orchestration#
Job queuing and parallel execution
Progress monitoring and status tracking
Result Export#
Export to MLflow, Weights & Biases, or local formats
Structured result formatting
Integration with experiment tracking platforms
Configuration Management#
YAML-based configuration system
Override parameters via command line
Template configurations for common scenarios
Next Steps#
Explore different evaluation types:
nemo-evaluator-launcher ls tasks
Try advanced configurations in the
packages/nemo-evaluator-launcher/examples/
directoryExport results to your preferred tracking platform
Scale to cluster execution with Slurm or cloud providers
For more advanced control, consider the NeMo Evaluator Core Python API or Container Direct approaches.