Local Execution#

Run evaluations on your local machine using Docker containers. The local executor connects to existing model endpoints and orchestrates evaluation tasks locally.

Important

The local executor does not deploy models. You must have an existing model endpoint running before starting evaluation. For launcher-orchestrated model deployment, use Slurm Deployment via Launcher or Lepton AI Deployment via Launcher.

Overview#

Local execution:

Runs evaluation containers locally using Docker
Connects to existing model endpoints (local or remote)
Suitable for development, testing, and small-scale evaluations
Supports parallel or sequential task execution

Quick Start#

# Run evaluation against existing endpoint
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct

Configuration#

Basic Configuration#

# examples/local_llama_3_1_8b_instruct.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: llama_3_1_8b_instruct_results
  # mode: sequential  # Optional: run tasks sequentially instead of parallel

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: API_KEY

evaluation:
  tasks:
    - name: ifeval
    - name: gpqa_diamond

Required fields:

execution.output_dir: Directory for results
target.api_endpoint.url: Model endpoint URL
evaluation.tasks: List of evaluation tasks

Execution Modes#

execution:
  output_dir: ./results
  mode: parallel  # Default: run tasks in parallel
  # mode: sequential  # Run tasks one at a time

Multi-Task Evaluation#

evaluation:
  tasks:
    - name: mmlu_pro
      overrides:
        config.params.limit_samples: 200
    - name: gsm8k
      overrides:
        config.params.limit_samples: 100
    - name: humaneval
      overrides:
        config.params.limit_samples: 50

Task-Specific Configuration#

evaluation:
  tasks:
    - name: gpqa_diamond
      overrides:
        config.params.temperature: 0.6
        config.params.top_p: 0.95
        config.params.max_new_tokens: 8192
        config.params.parallelism: 4
      env_vars:
        HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND

With Adapter Configuration#

Configure adapters using evaluation overrides:

target:
  api_endpoint:
    url: http://localhost:8080/v1/chat/completions
    model_id: my-model

evaluation:
  overrides:
    target.api_endpoint.adapter_config.use_reasoning: true
    target.api_endpoint.adapter_config.use_system_prompt: true
    target.api_endpoint.adapter_config.custom_system_prompt: "Think step by step."

For detailed adapter configuration options, refer to Evaluation Adapters.

Command-Line Usage#

Basic Commands#

# Run evaluation
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct

# Dry run to preview configuration
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    --dry-run

# Override endpoint URL
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    -o target.api_endpoint.url=http://localhost:8080/v1/chat/completions

Job Management#

# Check job status
nv-eval status <job_id>

# Check entire invocation
nv-eval status <invocation_id>

# Kill running job
nv-eval kill <job_id>

# List available tasks
nv-eval ls tasks

# List recent runs
nv-eval ls runs

Requirements#

System Requirements#

Docker: Docker Engine installed and running
Storage: Adequate space for evaluation containers and results
Network: Internet access to pull Docker images

Model Endpoint#

You must have a model endpoint running and accessible before starting evaluation. Options include:

Manual Deployment using vLLM, TensorRT-LLM, or other frameworks
Hosted Services like NVIDIA API Catalog or OpenAI
Custom deployment solutions

Troubleshooting#

Docker Issues#

Docker not running:

# Check Docker status
docker ps

# Start Docker daemon (varies by platform)
sudo systemctl start docker  # Linux
# Or open Docker Desktop on macOS/Windows

Permission denied:

# Add user to docker group (Linux)
sudo usermod -aG docker $USER
# Log out and back in for changes to take effect

Endpoint Connectivity#

Cannot connect to endpoint:

# Test endpoint availability
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "test", "messages": [{"role": "user", "content": "Hi"}]}'

API authentication errors:

Verify api_key_name matches your environment variable
Check that the environment variable has a value: echo $API_KEY
Check API key has proper permissions

Evaluation Issues#

Job hangs or shows no progress:

Check logs in the output directory:

# Track logs in real-time
tail -f <output_dir>/<task_name>/logs/stdout.log

# Kill and restart if needed
nv-eval kill <job_id>

Tasks fail with errors:

Check logs in <output_dir>/<task_name>/logs/stdout.log
Verify model endpoint supports required request format
Ensure adequate disk space for results

Configuration Validation#

# Validate configuration before running
nv-eval run \
    --config-dir examples \
    --config-name local_llama_3_1_8b_instruct \
    --dry-run

Next Steps#

Deploy your own model: See Manual Deployment for local model serving
Scale to HPC: Use Slurm Deployment via Launcher for cluster deployments
Cloud execution: Try Lepton AI Deployment via Launcher for cloud-based evaluation
Configure adapters: Add interceptors with Evaluation Adapters