Troubleshooting#

Comprehensive troubleshooting guide for NeMo Evaluator evaluations, organized by problem type and complexity level.

This section provides systematic approaches to diagnose and resolve evaluation issues. Start with the quick diagnostics below to verify your basic setup, then navigate to the appropriate troubleshooting category based on where your issue occurs in the evaluation workflow.

Quick Start#

Before diving into specific problem areas, run these basic checks to verify your evaluation environment:

Launcher Quick Check

# Verify launcher installation and basic functionality
nv-eval --version

# List available tasks
nv-eval ls tasks

# Validate configuration without running
nv-eval run --config-dir examples --config-name local_llama_3_1_8b_instruct --dry-run

# Check recent runs
nv-eval ls runs

Model Endpoint Check

import requests

# Check health endpoint (adjust based on your deployment)
# vLLM/SGLang/NIM: use /health
# NeMo/Triton: use /v1/triton_health
health_response = requests.get("http://0.0.0.0:8080/health", timeout=5)
print(f"Health Status: {health_response.status_code}")

# Test completions endpoint
test_payload = {
    "prompt": "Hello",
    "model": "megatron_model", 
    "max_tokens": 5
}
response = requests.post("http://0.0.0.0:8080/v1/completions/", json=test_payload)
print(f"Completions Status: {response.status_code}")

Core API Check

from nemo_evaluator import show_available_tasks

try:
    print("Available frameworks and tasks:")
    show_available_tasks()
except ImportError as e:
    print(f"Missing dependency: {e}")

Troubleshooting Categories#

Choose the category that best matches your issue for targeted solutions and debugging steps.

Setup & Installation

Installation problems, authentication setup, and model deployment issues to get NeMo Evaluator running.

Setup and Installation Issues

Runtime & Execution

Configuration validation and launcher management during evaluation execution.

Runtime and Execution Issues

Getting Help#

Log Collection#

When reporting issues, include:

System Information:

python --version
pip list | grep nvidia
nvidia-smi

Configuration Details:

print(f"Task: {eval_cfg.type}")
print(f"Endpoint: {target_cfg.api_endpoint.url}")
print(f"Model: {target_cfg.api_endpoint.model_id}")

Error Messages: Full stack traces and error logs

Community Resources#

GitHub Issues: NeMo Evaluator Issues
Discussions: GitHub Discussions
Documentation: NeMo Evaluator Documentation

Professional Support#

For enterprise support, contact: nemo-toolkit@nvidia.com