Troubleshooting NeMo Evaluator#

Use this documentation to troubleshoot issues that can arise when you work with NVIDIA NeMo Evaluator.

Tip

You can get logs for COMPLETED or FAILED jobs and use them to help troubleshoot.


Dataset {dataset} is not in the expected format; it needs to have the files_url property set#

This means that either the files_url is not provided as part of the dataset specification in the config, or that the files_url is not provided in the expected format. The dataset must be a JSON object with the files_url property set, pointing to the path of the file in the NeMo Data Store in the format: hf://datasets/<dataset-namespace>/<dataset-namespace>/<file-path>.

Error connecting to inference server#

This means that for a custom evaluation, the target LLM endpoint is unable to connect.

Error occurred while checking the existence of file {file_ref} on NeMo Data Store#

This could mean that the dataset is not specified correctly, or that the NeMo Data Store itself is unresponsive.

  • Verify that the files URL is correct and that the dataset and file exists in the NeMo Data Store.

  • Verify that the NeMo Data Store is responsive and reachable.

If the error contains the string Dataset {file_ref} is not present on datastore, it means that the datastore is responsive, but the file reference does not exist.

Evaluation Job Takes a Long Time#

The time that an Evaluation job takes can vary from a few minutes to many hours, depending on the target model, config, and other factors. As long as the status is running, your job is still running. If there is a problem with your job, you will see unavailable or failed. For more information, see Expected Evaluation Duration.

Invalid parameters specified for filter#

This error means that there is an invalid parameter while running a GET request with a filter. For the parameters that you can use to filter queries, refer to Filter and Sort Responses from the NVIDIA NeMo Evaluator API.

Job cannot be launched#

This means that one of the pre-launch validations has failed. The error contains the details about the checks that failed.

Missing required environment variable#

This error means that a required environment variable is not set correctly. For example, DATA_STORE_URL is not set in the NeMo Evaluator deployment.

Unable to launch the evaluation job because there is a problem in the target, config, or environment#

This error appears in the status message of the job. It means that the evaluation config does not contain all the required parameters. Refer to the evaluation config documentation for examples.

It might also mean that the target is not correctly set. Either the target is not reachable, or its parameters, for example, model_id, are not correctly set. Refer to the evaluation target documentation for examples.

Unavailable Job#

An evaluation job can be marked as unavailable for many reasons, including the following:

  • Argo Workflows is not available

  • NeMo Data Store is not available

  • An unexpected termination of the evaluation job pod

The status detail of an unavailable job contains Unable to get job status because the infrastructure is unresponsive. To troubleshoot the root cause, query the evaluation job details, and check the detailed status message returned in the status JSON object. For more information, see Get Evaluation Job Details.

Unsupported metric type#

An unsupported metric was provided for custom evaluation. Check the custom evaluation documentation for supported metrics.

What is EVALUATOR_SERVICE_URL?#

EVALUATOR_SERVICE_URL is a placeholder for the URL of the evaluator API, that is used in the examples in the documentation. The URL of the evaluator API depends on where you deploy evaluator and how you configure it. For example, your evaluator API URL might look like evaluator.internal.your-company.com.

If you are running Evaluator in a local Kubernetes minikube, be sure to enable ingress by using the following code.

minikube addons enable ingress

You can also port-forward the evaluator service endpoint by using the following code. After you port-forward, the EVALUATOR_SERVICE_URL would be localhost:7331.

kubectl port-forward svc/nemo-evaluator 7331:7331

nemo-evaluator is the default service name. You can verify the service name be using the following code in the namespace where the deployment is.

kubectl get svc  

Advanced Troubleshooting#

Evaluation jobs (all config types except custom) use Argo Workflows to orchestrate the workloads. To troubleshoot an evaluation job that has failed, you can check the evaluation job logs and pod logs.

Warning

These are advanced troubleshooting steps that should only be done after all other troubleshooting fails.

Prerequisites#

Before you can use these steps, you need the following:

  • Already tried all other troubleshooting steps in this documentation

  • Basic knowledge of Kubernetes

  • Access to the Kubernetes cluster where the service is deployed

  • Install Kubectl and configure it to access the K8s cluster where the service is deployed

  • Install Argo CLI

Evaluation Job Logs#

To download the log files, use the download-results endpoint. This endpoint downloads the result directory containing configuration files, logs, and evaluation results for a specific evaluation run. The result directory is packaged and provided as a downloadable archive.

To download the evaluation results directory, use the following code.

curl -X 'GET' \
  '<BASE_URL>/v1/evaluation/jobs/<job-id>/download-results' \
  -H 'accept: application/json' \
  -o result.zip

After the download is complete, the log files are available inside the result.zip file. Log files can be found in the results folder with the file extension *.log.

Kubernetes Pod Logs#

Use this procedure to check the Kubernetes pod logs.

  1. To find the workflow id for the evaluation job, you can query the logs of NeMo Evaluator deployment pod. For the following example deployment:

    release name

    myrelease

    pod name

    myrelease-nemo-evaluator-68b4bydh-eidjf

    namespace

    nemo-evaluation

    Run the following code.

    kubectl logs -n nemo-evaluation myrelease-nemo-evaluator-68b4bydh-eidjf
    

    Look for logs starting with Eval workflow submitted:eval-commands-, where eval-commands-{5 length id} is the workflow name submitted to Argo Workflow.

  2. Check the workflow logs.

    • (Option A) Check the workflow logs by using the Argo UI.

      1. Find the Argo Service by running the following code. The service with argo-workflows-server is the Argo service. For example, for release myrelease, the Argo service will be myrelease-argo-workflows-server.

        ```bash
        kubectl get service -n nemo-evaluation
        ```
        
      2. Port forward the UI by running the following code.

        ```bash
        kubectl port-forward -n nemo-evaluation svc/{Argo Service} 2746:2746
        ```
        
      3. Open https://localhost:2746 and click on the Argo Workflow ID that was identified in the first step. This provides a visual representation of the workflow.

    • (Option B) Check the workflow logs by using the Argo CLI. Run the following code. You can identify the pod that failed, along with the error message from these logs.

      argo logs -n nemo-evaluation {Argo Workflow ID}
      
  3. (Optional) If workflow logs do not resolve your issue, you can check the pod logs.

    • (Option A) Check the pod logs by using the Argo UI. Click on an individual pod in the visual representation of the workflow from the previous step. Then, click LOGS to view detailed logs of the pod.

    • (Option B) Check the pod logs by using the Argo CLI. Run the following command on the failed pod, to view detailed logs.

      kubectl logs -n nemo-evaluation {Argo Workflow Pod}
      

Skip validation checks#

When you launch an evaluation job, NeMo Evaluator performs availability checks (for example, checking if the dataset and files exist in NeMo Data Store). To speed up job launch, or due to strict constraints of validation checks, you can pass the query parameter skip_validation_checks during job launch.

Use the following code to create an evaluation job that skips validation checks.

curl -X 'POST' \
  'https://${EVALUATOR_SERVICE_URL}/v1/evaluation/jobs?skip_validation_checks=True' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "namespace": "my-organization",
    "target": "<my-target-namespace/my-target-name>",
    "config": "<my-config-namespace/my-config-name>"
}'
data = {
   "namespace": "my-organization",
   "target": "<my-target-namespace/my-target-name>",
   "config": "<my-config-namespace/my-config-name>"
}

endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/jobs?skip_validation_checks=True"

response = requests.post(endpoint, json=data).json()