Local Execution#
Run evaluations on your local machine using Docker containers. The local executor connects to existing model endpoints and orchestrates evaluation tasks locally.
Important
The local executor does not deploy models. You must have an existing model endpoint running before starting evaluation. For launcher-orchestrated model deployment, use Slurm Deployment via Launcher or Lepton AI Deployment via Launcher.
Overview#
Local execution:
Runs evaluation containers locally using Docker
Connects to existing model endpoints (local or remote)
Suitable for development, testing, and small-scale evaluations
Supports parallel or sequential task execution
Quick Start#
# Run evaluation against existing endpoint
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct
Configuration#
Basic Configuration#
# examples/local_llama_3_1_8b_instruct.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: llama_3_1_8b_instruct_results
# mode: sequential # Optional: run tasks sequentially instead of parallel
target:
api_endpoint:
model_id: meta/llama-3.1-8b-instruct
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key_name: API_KEY
evaluation:
tasks:
- name: ifeval
- name: gpqa_diamond
Required fields:
execution.output_dir
: Directory for resultstarget.api_endpoint.url
: Model endpoint URLevaluation.tasks
: List of evaluation tasks
Execution Modes#
execution:
output_dir: ./results
mode: parallel # Default: run tasks in parallel
# mode: sequential # Run tasks one at a time
Multi-Task Evaluation#
evaluation:
tasks:
- name: mmlu_pro
overrides:
config.params.limit_samples: 200
- name: gsm8k
overrides:
config.params.limit_samples: 100
- name: humaneval
overrides:
config.params.limit_samples: 50
Task-Specific Configuration#
evaluation:
tasks:
- name: gpqa_diamond
overrides:
config.params.temperature: 0.6
config.params.top_p: 0.95
config.params.max_new_tokens: 8192
config.params.parallelism: 4
env_vars:
HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND
With Adapter Configuration#
Configure adapters using evaluation overrides:
target:
api_endpoint:
url: http://localhost:8080/v1/chat/completions
model_id: my-model
evaluation:
overrides:
target.api_endpoint.adapter_config.use_reasoning: true
target.api_endpoint.adapter_config.use_system_prompt: true
target.api_endpoint.adapter_config.custom_system_prompt: "Think step by step."
For detailed adapter configuration options, refer to Evaluation Adapters.
Command-Line Usage#
Basic Commands#
# Run evaluation
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct
# Dry run to preview configuration
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
--dry-run
# Override endpoint URL
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
-o target.api_endpoint.url=http://localhost:8080/v1/chat/completions
Job Management#
# Check job status
nv-eval status <job_id>
# Check entire invocation
nv-eval status <invocation_id>
# Kill running job
nv-eval kill <job_id>
# List available tasks
nv-eval ls tasks
# List recent runs
nv-eval ls runs
Requirements#
System Requirements#
Docker: Docker Engine installed and running
Storage: Adequate space for evaluation containers and results
Network: Internet access to pull Docker images
Model Endpoint#
You must have a model endpoint running and accessible before starting evaluation. Options include:
Manual Deployment using vLLM, TensorRT-LLM, or other frameworks
Hosted Services like NVIDIA API Catalog or OpenAI
Custom deployment solutions
Troubleshooting#
Docker Issues#
Docker not running:
# Check Docker status
docker ps
# Start Docker daemon (varies by platform)
sudo systemctl start docker # Linux
# Or open Docker Desktop on macOS/Windows
Permission denied:
# Add user to docker group (Linux)
sudo usermod -aG docker $USER
# Log out and back in for changes to take effect
Endpoint Connectivity#
Cannot connect to endpoint:
# Test endpoint availability
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "test", "messages": [{"role": "user", "content": "Hi"}]}'
API authentication errors:
Verify
api_key_name
matches your environment variableCheck that the environment variable has a value:
echo $API_KEY
Check API key has proper permissions
Evaluation Issues#
Job hangs or shows no progress:
Check logs in the output directory:
# Track logs in real-time
tail -f <output_dir>/<task_name>/logs/stdout.log
# Kill and restart if needed
nv-eval kill <job_id>
Tasks fail with errors:
Check logs in
<output_dir>/<task_name>/logs/stdout.log
Verify model endpoint supports required request format
Ensure adequate disk space for results
Configuration Validation#
# Validate configuration before running
nv-eval run \
--config-dir examples \
--config-name local_llama_3_1_8b_instruct \
--dry-run
Next Steps#
Deploy your own model: See Manual Deployment for local model serving
Scale to HPC: Use Slurm Deployment via Launcher for cluster deployments
Cloud execution: Try Lepton AI Deployment via Launcher for cloud-based evaluation
Configure adapters: Add interceptors with Evaluation Adapters