Local Execution#
Run evaluations on your local machine using Docker containers. The local executor connects to existing model endpoints and orchestrates evaluation tasks locally.
Important
The local executor does not deploy models. You must have an existing model endpoint running before starting evaluation. For launcher-orchestrated model deployment, use Slurm Deployment via Launcher or Lepton AI Deployment via Launcher.
Overview#
Local execution:
Runs evaluation containers locally using Docker
Connects to existing model endpoints (local or remote)
Suitable for development, testing, and small-scale evaluations
Supports parallel or sequential task execution
Quick Start#
# Run evaluation against existing endpoint
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct
Configuration#
Basic Configuration#
# examples/local_llama_3_1_8b_instruct.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: llama_3_1_8b_instruct_results
# mode: sequential # Optional: run tasks sequentially instead of parallel
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: NGC_API_KEY
evaluation:
tasks:
- name: ifeval
- name: gpqa_diamond
Required fields:
execution.output_dir
: Directory for resultstarget.api_endpoint.url
: Model endpoint URLevaluation.tasks
: List of evaluation tasks
Execution Modes#
execution:
output_dir: ./results
mode: parallel # Default: run tasks in parallel
# mode: sequential # Run tasks one at a time
Multi-Task Evaluation#
evaluation:
tasks:
- name: mmlu_pro
overrides:
config.params.limit_samples: 200
- name: gsm8k
overrides:
config.params.limit_samples: 100
- name: humaneval
overrides:
config.params.limit_samples: 50
Task-Specific Configuration#
evaluation:
tasks:
- name: gpqa_diamond
overrides:
config.params.temperature: 0.6
config.params.top_p: 0.95
config.params.max_new_tokens: 8192
config.params.parallelism: 4
env_vars:
HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND
With Adapter Configuration#
Configure adapters using evaluation overrides:
target:
api_endpoint:
url: http://localhost:8080/v1/chat/completions
model_id: my-model
evaluation:
overrides:
target.api_endpoint.adapter_config.use_reasoning: true
target.api_endpoint.adapter_config.use_system_prompt: true
target.api_endpoint.adapter_config.custom_system_prompt: "Think step by step."
For detailed adapter configuration options, refer to Evaluation Adapters.
Advanced settings#
If you are deploying the model locally with Docker, you can use a dedicated docker network. This will provide a secure connetion between deployment and evaluation docker containers.
docker network create my-custom-network
docker run --gpus all --network my-custom-network --name my-phi-container vllm/vllm-openai:latest \
--model microsoft/Phi-4-mini-instruct
Then use the same network in the evaluator config:
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: my_phi_test
extra_docker_args: "--network my-custom-network"
target:
api_endpoint:
model_id: microsoft/Phi-4-mini-instruct
url: http://my-phi-container:8000/v1/chat/completions
api_key_name: null
evaluation:
tasks:
- name: simple_evals.mmlu_pro
overrides:
config.params.limit_samples: 10 # TEST ONLY: Limits to 10 samples for quick testing
config.params.parallelism: 1
Alternatively you can expose ports and use the host network:
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model microsoft/Phi-4-mini-instruct
execution:
extra_docker_args: "--network host"
Command-Line Usage#
Basic Commands#
# Run evaluation
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct
# Dry run to preview configuration
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct \
--dry-run
# Override endpoint URL
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct \
-o target.api_endpoint.url=http://localhost:8080/v1/chat/completions
Job Management#
# Check job status
nemo-evaluator-launcher status <job_id>
# Check entire invocation
nemo-evaluator-launcher status <invocation_id>
# Kill running job
nemo-evaluator-launcher kill <job_id>
# List available tasks
nemo-evaluator-launcher ls tasks
# List recent runs
nemo-evaluator-launcher ls runs
Requirements#
System Requirements#
Docker: Docker Engine installed and running
Storage: Adequate space for evaluation containers and results
Network: Internet access to pull Docker images
Model Endpoint#
You must have a model endpoint running and accessible before starting evaluation. Options include:
Manual Deployment using vLLM, TensorRT-LLM, or other frameworks
Hosted Services like NVIDIA API Catalog or OpenAI
Custom deployment solutions
Troubleshooting#
Docker Issues#
Docker not running:
# Check Docker status
docker ps
# Start Docker daemon (varies by platform)
sudo systemctl start docker # Linux
# Or open Docker Desktop on macOS/Windows
Permission denied:
# Add user to docker group (Linux)
sudo usermod -aG docker $USER
# Log out and back in for changes to take effect
Endpoint Connectivity#
Cannot connect to endpoint:
# Test endpoint availability
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "test", "messages": [{"role": "user", "content": "Hi"}]}'
API authentication errors:
Verify
api_key_name
matches your environment variableCheck that the environment variable has a value:
echo $NGC_API_KEY
Check API key has proper permissions
Evaluation Issues#
Job hangs or shows no progress:
Check logs in the output directory:
# Track logs in real-time
tail -f <output_dir>/<task_name>/logs/stdout.log
# Kill and restart if needed
nemo-evaluator-launcher kill <job_id>
Tasks fail with errors:
Check logs in
<output_dir>/<task_name>/logs/stdout.log
Verify model endpoint supports required request format
Ensure adequate disk space for results
Configuration Validation#
# Validate configuration before running
nemo-evaluator-launcher run \
--config-dir packages/nemo-evaluator-launcher/examples \
--config-name local_llama_3_1_8b_instruct \
--dry-run
Next Steps#
Deploy your own model: See Manual Deployment for local model serving
Scale to HPC: Use Slurm Deployment via Launcher for cluster deployments
Cloud execution: Try Lepton AI Deployment via Launcher for cloud-based evaluation
Configure adapters: Add interceptors with Evaluation Adapters