Troubleshooting NeMo Evaluator#
Use this documentation to troubleshoot issues that can arise when you work with NVIDIA NeMo Evaluator.
Tip
You can get logs for COMPLETED
or FAILED
jobs and use them to help troubleshoot.
Dataset {dataset} is not in the expected format; it needs to have the files_url property set#
This means that either the files_url
is not provided as part of the dataset specification in the config,
or that the files_url
is not provided in the expected format.
The dataset must be a JSON object with the files_url
property set,
pointing to the path of the file in the NeMo Data Store in the format: hf://datasets/<dataset-namespace>/<dataset-namespace>/<file-path>
.
Error connecting to inference server#
This means that for a custom evaluation, the target LLM endpoint is unable to connect.
Error occurred while checking the existence of file {file_ref} on NeMo Data Store#
This could mean that the dataset is not specified correctly, or that the NeMo Data Store itself is unresponsive.
Verify that the files URL is correct and that the dataset and file exists in the NeMo Data Store.
Verify that the NeMo Data Store is responsive and reachable.
If the error contains the string Dataset {file_ref} is not present on datastore
,
it means that the datastore is responsive, but the file reference does not exist.
Evaluation Job Takes a Long Time#
The time that an Evaluation job takes can vary from a few minutes to many hours,
depending on the target model, config, and other factors.
As long as the status is running
, your job is still running.
If there is a problem with your job, you will see unavailable
or failed
.
For more information, see Expected Evaluation Duration.
Invalid parameters specified for filter#
This error means that there is an invalid parameter while running a GET
request with a filter.
For the parameters that you can use to filter queries, refer to Filter and Sort Responses from the NVIDIA NeMo Evaluator API.
Job cannot be launched#
This means that one of the pre-launch validations has failed. The error contains the details about the checks that failed.
Missing required environment variable#
This error means that a required environment variable is not set correctly.
For example, DATA_STORE_URL
is not set in the NeMo Evaluator deployment.
Unable to launch the evaluation job because there is a problem in the target, config, or environment#
This error appears in the status message of the job. It means that the evaluation config does not contain all the required parameters. Refer to the evaluation config documentation for examples.
It might also mean that the target is not correctly set. Either the target is not reachable, or its parameters, for example, model_id, are not correctly set. Refer to the evaluation target documentation for examples.
Unsupported metric type#
An unsupported metric was provided for custom evaluation. Check the custom evaluation documentation for supported metrics.
What is EVALUATOR_SERVICE_URL?#
EVALUATOR_SERVICE_URL
is a placeholder for the URL of the evaluator API,
that is used in the examples in the documentation.
The URL of the evaluator API depends on where you deploy evaluator and how you configure it.
For example, your evaluator API URL might look like evaluator.internal.your-company.com
.
To install Evaluator in a Kubernetes minikube environment, see Demo Cluster Setup on Minikube.
To install Evaluator in a deployment environment, see Deploy the NeMo Evaluator Microservice.
If you are running Evaluator in a local Kubernetes minikube, be sure to enable ingress by using the following code.
minikube addons enable ingress
You can also port-forward the evaluator service endpoint by using the following code.
After you port-forward, the EVALUATOR_SERVICE_URL
would be localhost:7331
.
kubectl port-forward svc/nemo-evaluator 7331:7331
nemo-evaluator
is the default service name.
You can verify the service name be using the following code in the namespace where the deployment is.
kubectl get svc
Advanced Troubleshooting#
Evaluation jobs (all config types except custom
) use Argo Workflows to orchestrate the workloads.
To troubleshoot an evaluation job that has failed, you can check the evaluation job logs and pod logs.
Warning
These are advanced troubleshooting steps that should only be done after all other troubleshooting fails.
Prerequisites#
Before you can use these steps, you need the following:
Already tried all other troubleshooting steps in this documentation
Basic knowledge of Kubernetes
Access to the Kubernetes cluster where the service is deployed
Install Kubectl and configure it to access the K8s cluster where the service is deployed
Install Argo CLI
Evaluation Job Logs#
To download the log files, use the download-results
endpoint.
This endpoint downloads the result directory containing configuration files, logs, and evaluation results for a specific evaluation run.
The result directory is packaged and provided as a downloadable archive.
To download the evaluation results directory, use the following code.
curl -X 'GET' \
'<BASE_URL>/v1/evaluation/jobs/<job-id>/download-results' \
-H 'accept: application/json' \
-o result.zip
After the download is complete, the log files are available inside the result.zip file.
Log files can be found in the results folder with the file extension *.log
.
Kubernetes Pod Logs#
Use this procedure to check the Kubernetes pod logs.
To find the workflow id for the evaluation job, you can query the logs of NeMo Evaluator deployment pod. For the following example deployment:
release name
myrelease
pod name
myrelease-nemo-evaluator-68b4bydh-eidjf
namespace
nemo-evaluation
Run the following code.
kubectl logs -n nemo-evaluation myrelease-nemo-evaluator-68b4bydh-eidjf
Look for logs starting with
Eval workflow submitted:eval-commands-
, whereeval-commands-{5 length id}
is the workflow name submitted to Argo Workflow.Check the workflow logs.
(Option A) Check the workflow logs by using the Argo UI.
Find the Argo Service by running the following code. The service with
argo-workflows-server
is the Argo service. For example, for releasemyrelease
, the Argo service will bemyrelease-argo-workflows-server
.```bash kubectl get service -n nemo-evaluation ```
Port forward the UI by running the following code.
```bash kubectl port-forward -n nemo-evaluation svc/{Argo Service} 2746:2746 ```
Open https://localhost:2746 and click on the Argo Workflow ID that was identified in the first step. This provides a visual representation of the workflow.
(Option B) Check the workflow logs by using the Argo CLI. Run the following code. You can identify the pod that failed, along with the error message from these logs.
argo logs -n nemo-evaluation {Argo Workflow ID}
(Optional) If workflow logs do not resolve your issue, you can check the pod logs.
(Option A) Check the pod logs by using the Argo UI. Click on an individual pod in the visual representation of the workflow from the previous step. Then, click
LOGS
to view detailed logs of the pod.(Option B) Check the pod logs by using the Argo CLI. Run the following command on the failed pod, to view detailed logs.
kubectl logs -n nemo-evaluation {Argo Workflow Pod}
Skip validation checks#
When you launch an evaluation job, NeMo Evaluator performs availability checks (for example, checking if the dataset and files exist in NeMo Data Store).
To speed up job launch, or due to strict constraints of validation checks, you can pass the query parameter skip_validation_checks
during job launch.
Use the following code to create an evaluation job that skips validation checks.
curl -X 'POST' \
'https://${EVALUATOR_SERVICE_URL}/v1/evaluation/jobs?skip_validation_checks=True' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"namespace": "my-organization",
"target": "<my-target-namespace/my-target-name>",
"config": "<my-config-namespace/my-config-name>"
}'
data = {
"namespace": "my-organization",
"target": "<my-target-namespace/my-target-name>",
"config": "<my-config-namespace/my-config-name>"
}
endpoint = f"{EVALUATOR_SERVICE_URL}/v1/evaluation/jobs?skip_validation_checks=True"
response = requests.post(endpoint, json=data).json()