Launcher Issues#
Troubleshooting guide for NeMo Evaluator Launcher-specific problems including configuration validation, job management, and multi-backend execution issues.
Configuration Issues#
Configuration Validation Errors#
Problem: Configuration fails validation before execution
Solution: Use dry-run to validate configuration:
# Validate configuration without running
nemo-evaluator-launcher run --config-dir packages/nemo-evaluator-launcher/examples --config-name local_llama_3_1_8b_instruct --dry-run
Common Issues:
Missing Required Fields
Error: Missing required field 'execution.output_dir'
Fix: Add output directory to config or override:
nemo-evaluator-launcher run --config-dir packages/nemo-evaluator-launcher/examples --config-name local_llama_3_1_8b_instruct \
-o execution.output_dir=./results
Invalid Task Names
Error: Unknown task 'invalid_task'. Available tasks: hellaswag, arc_challenge, ...
Fix: List available tasks and use correct names:
nemo-evaluator-launcher ls tasks
Configuration Conflicts
Error: Cannot specify both 'api_key' and 'api_key_name' in target.api_endpoint
Fix: Use only one authentication method in configuration.
Hydra Configuration Errors#
Problem: Hydra fails to resolve configuration composition
Common Errors:
MissingConfigException: Cannot find primary config 'missing_config'
Solutions:
Verify Config Directory:
# List available configs
ls examples/
# Ensure config file exists
ls examples/local_llama_3_1_8b_instruct.yaml
Check Config Composition:
# Verify defaults section in config file
defaults:
- execution: local
- deployment: none
- _self_
Use Absolute Paths:
nemo-evaluator-launcher run --config-dir /absolute/path/to/configs --config-name my_config
Job Management Issues#
Job Status Problems#
Problem: Cannot check job status or jobs appear stuck
Diagnosis:
# Check job status
nemo-evaluator-launcher status <invocation_id>
# List all runs
nemo-evaluator-launcher ls runs
# Check specific job
nemo-evaluator-launcher status <job_id>
Common Issues:
Invalid Invocation ID:
Error: Invocation 'abc123' not found
Fix: Use correct invocation ID from run output or list recent runs:
nemo-evaluator-launcher ls runs
Stale Job Database: Fix: Check execution database location and permissions:
# Database location
ls -la ~/.nemo-evaluator/exec-db/exec.v1.jsonl
Job Termination Issues#
Problem: Cannot kill running jobs
Solutions:
# Kill entire invocation
nemo-evaluator-launcher kill <invocation_id>
# Kill specific job
nemo-evaluator-launcher kill <job_id>
Executor-Specific Issues:
Local: Jobs run in Docker containers - ensure Docker daemon is running
Slurm: Check Slurm queue status with
squeue
Lepton: Verify Lepton workspace connectivity
Multi-Backend Execution Issues#
Local Executor Problems
Problem: Docker-related execution failures
Common Issues:
Docker Not Running:
Error: Cannot connect to Docker daemon
Fix: Start Docker daemon:
# macOS/Windows: Start Docker Desktop
# Linux:
sudo systemctl start docker
Container Pull Failures:
Error: Failed to pull container image
Fix: Check network connectivity and container registry access.
Slurm Executor Problems
Problem: Jobs fail to submit to Slurm cluster
Diagnosis:
# Check Slurm cluster status
sinfo
squeue -u $USER
# Check partition availability
sinfo -p <partition_name>
Common Issues:
Invalid Partition:
Error: Invalid partition name 'gpu'
Fix: Use correct partition name:
# List available partitions
sinfo -s
Resource Unavailable:
Error: Insufficient resources for job
Fix: Adjust resource requirements:
execution:
num_nodes: 1
gpus_per_node: 2
walltime: "2:00:00"
Lepton Executor Problems
Problem: Lepton deployment or execution failures
Diagnosis:
# Check Lepton authentication
lep workspace list
# Test connection
lep deployment list
Common Issues:
Authentication Failure:
Error: Invalid Lepton credentials
Fix: Re-authenticate with Lepton:
lep login -c <workspace_name>:<your_token>
Deployment Timeout:
Error: Deployment failed to reach Ready state
Fix: Check Lepton workspace capacity and deployment status.
Export Issues#
Export Failures#
Problem: Results export fails to destination
Diagnosis:
# List completed runs
nemo-evaluator-launcher ls runs
# Try export
nemo-evaluator-launcher export <invocation_id> --dest local --format json
Common Issues:
Missing Dependencies:
Error: MLflow not installed
Fix: Install required exporter dependencies:
pip install nemo-evaluator-launcher[mlflow]
Authentication Issues:
Error: Invalid W&B credentials
Fix: Configure authentication for export destination:
# W&B
wandb login
Getting Help#
Debug Information Collection#
When reporting launcher issues, include:
Configuration Details:
# Show resolved configuration
nemo-evaluator-launcher run --config-dir packages/nemo-evaluator-launcher/examples --config-name <config> --dry-run
System Information:
# Launcher version
nemo-evaluator-launcher --version
# System info
python --version
docker --version # For local executor
sinfo # For Slurm executor
lep workspace list # For Lepton executor
Job Information:
# Job status
nemo-evaluator-launcher status <invocation_id>
# Recent runs
nemo-evaluator-launcher ls runs
Log Files:
Local executor: Check
<output_dir>/<task_name>/logs/stdout.log
Slurm executor: Check job output files in output directory
Lepton executor: Check Lepton job logs via Lepton CLI
For complex issues, see the Python API documentation.