Troubleshooting NeMo Customizer#

Use this documentation to troubleshoot issues that can arise when you work with NVIDIA NeMo Customizer.

Status Details#

View the status details of the job with the following endpoint. The response field status_details.status_logs contains events for the job.

curl -X GET \
    "${CUSTOMIZER_BASE_URL}/v1/customization/jobs/{id}" \
    -H 'Accept: application/json' | jq

A training job comprises 3 stages and they are captured in the status logs.

Download dataset stage is logged with messages prefixed with EntityHandler_0.
Train custom model stage is logged with messages prefixed with TrainingJob.
Upload custom model stage is logged with messages prefixed with EntityHandler_1.

A training job is broken up into stages that are queued one after another as one stage completes. The job status running indicates the job has been created, scheduled, and the first stage is running or has already completed.

The second stage to train the custom model requires GPUs and may be waiting in a queue until resources are available to be scheduled. This is indicated with TrainingJobPending. When the training stage is scheduled, the status logs will update with a TrainingJobRunning entry.

The last stage uploads the trained model to NeMo Data Store. The custom model is available in Entity Store for NIM to load from once the overall job status is marked completed.

Failed Job#

You can troubleshoot a failed job by viewing the status_logs for reported errors that occurred during training. Some errors may be resolved by modifying your job parameters. Some errors should be reported to the administrator managing the NeMo Customizer microservice. In other cases, please report a bug for errors that cannot be resolved.

View error(s) reported for a failed job using the /v1/customization/jobs/{id} endpoint.

curl -X GET \
    "${CUSTOMIZER_BASE_URL}/v1/customization/jobs/{id}" \
    -H 'Accept: application/json' | jq

Container Logs#

Container logs for the job are available after the job ends through the API endpoint /v1/customization/jobs/{id}/container-logs. The response include events and logs for all stages of the job that have run. Container logs are useful to troubleshoot unexpected behaviors and errors or inspect inconsistencies with training.

curl -X GET \
  "${CUSTOMIZER_BASE_URL}/v1/customization/jobs/{id}/container-logs" \
  -H 'Accept: application/json' | jq

Common Errors#

The following tables capture common errors and how to troubleshoot the problem for different components of the job and platform.

Dataset#

Error Status	Error Message	Reason	Troubleshoot
422	`{dataset-namespace}/{dataset-name} is not found or not available for customization.`	The dataset specified for the training job is not found in Entity Store.	Verify the dataset exists in Entity Store, in addition to NeMo Data Store.
422	`hf://datasets/{dataset-namespace}/{dataset-name} is not found or not available for customization.`	The dataset specified for the training job in Entity Store points to `files_url` that is not found in NeMo Data Store, the backing data storage.	Verify the dataset exists in NeMo Data Store at the `files_url` location. Visit Create Dataset for instructions to upload dataset files or update the `files_url` in Entity Store to the correct path in NeMo Data Store.
422	`Error validating dataset, unsupported data store expected 'hf://datasets/': {files_url}`	The backend storage of `files_url` for the dataset in Entity Store is not supported.	Update the `files_url` to a valid NeMo Data Store dataset. The `files_url` must match the format `hf://{namespace}/{name}@{version}`.
Job failed	`{filepath} has entry which is not valid json`	The dataset contains invalid JSON line(s).	Fix the dataset to contain valid JSON lines, reupload the file, and create a new job.
Job failed	`Parsed jsonl line does not conform to schema`	The dataset does not conform to a supported schema for training.	Reformat the dataset to a supported schema, reupload the file, and create a new job. Visit the Format Training Dataset tutorial for more information.
Job failed	`Batch size {batch_size} cannot be larger than number of samples for validation {num_samples}`	The validation dataset is too small to evaluate validation loss with the specified batch size for the job.	Reduce the batch size according to the validation dataset size or increase the number of samples in the validation dataset.

Models#

Base#

Error Status	Error Message	Reason	Troubleshoot
422	`{model-name} is not configured for training`	The base model is not available in Customizer to train a custom model with.	View the list of available models from the endpoint `/v1/customization/configs` and update the `config` value in the job request. If the model is expected to be available in Customizer, contact your administrator to enable the model for training from the deployment configuration.
409	`Model {model-name} is downloading to cache, try again later`	The base model to train the custom model is downloading to the model cache and is not ready to train with.	Retry the job request at a later time. Retry after a couple of minutes for smaller models. It may take up to an hour or longer for large models.
500	`Model {model-name} errored during download, contact your administrator`	The base model to train the custom model doesn’t exist in the models cache due to a download error.	Check application logs for download errors and view the downloader pod logs. Verify `model_uri` is correct. For NGC authentication issues, ensure `NGC_API_KEY` is set for the NVIDIA Customizer deployment and `ngcAPISecret`/`ngcAPISecretKey` Helm values point to a valid NGC API key with model download permissions. Verify the downloader has write permissions to the model cache PVC. Restart the application to trigger a new download. For storage issues, increase your models cache capacity (default is 20 GB) to accommodate your enabled models.

Note

Currently, the only way to trigger new model download jobs (after they’ve failed) is to restart the NVIDIA Customizer deployment. Your administrator can run kubectl rollout restart or delete the NVIDIA Customizer API pod.

Deployment#

Models may erroneously transition to a “failed” state if:

Multiple Customizer pods exist during an installation or upgrade
Customizer restarts with completed model downloader jobs existing in the namespace

To re-enable these models, perform the following steps:

Delete all the completed model-downloader-* jobs in the namespace. Please reach out to the admin to delete the completed model download jobs.

Update ttlSecondsAfterFinished for modelDownloader to 30 seconds or less. The default TTL for model downloader jobs is set to 7200 seconds (2 hours).

modelDownloader:
  # -- Security context for the model downloader.
  securityContext:
    fsGroup: 1000
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
  # -- Time to live in seconds after the job finishes.
  ttlSecondsAfterFinished: 30

Execute the following commands:

# Find failed targets
curl "https://${CUSTOMIZER_BASE_URL}/v1/customization/targets?page_size=1000&filter="'\{"status":"failed","enabled":null\}' | jq -r '.data[] |  .namespace + "/" + .name'

# For each failed target
curl -X PATCH -H 'Content-Type: application/json' --data-binary '{"enabled": true}' https://${CUSTOMIZER_BASE_URL}/v1/customization/targets/${TARGET_NAMESPACE}/${TARGET_NAME} | jq . 

NIM#

Unsupported model error with status code 400 when chatting or prompting the model occurs when NIM does not have reference of the custom model.

curl https://${NIM_HOSTNAME}/v1/completions \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "model-namespace/custom-model@version",
        "prompt": "hello world"
    }' | jq
{
  "message": "Unsupported model",
  "request_id": "44552dd4-1a51-9495-8ccc-967942068bee"
}

NIM deploys LoRA models trained from Customizer automatically. This error can occur if the model has not loaded yet and you can retry after a couple of seconds for the LoRA model to be ready. The endpoint /v1/models can be used to check whether the model is loaded by NIM.

If the Unsupported model error persists:

Verify the base model specified for the custom model in Entity Store has a matching NIM deployed.
Verify the model name in the inference request with NIM includes the namespace and it corresponds to the namespace used to create the model.

Output#

Error Status	Error Message	Reason	Troubleshoot
409	`Failed to create model repo in Nemo Data Store: Error: model with name and version, {namespace}/{name}@{version} already exists`	The specified `output_model` in the job request already exists.	Update the requested `output_model` to a new model name or update the version when using the same model name. If the version is omitted, the job ID is the default version used for the output model.

Out Of Memory#

CUDA out of memory. Tried to allocate XXX GiB. GPU 0 has a total capacity of XXX GiB of which XXX GiB is free.

A job failed from CUDA out of memory error can occur when the resources allocated for the job is insufficient for the model and training parameters. If this occurs for a variety of jobs with reasonable set of hyperparameters and dataset sizes, then the administrator managing the NeMo Customizer microservice can look to modify the resources configured, like number of GPUs and nodes. Consider reducing the micro batch size, which is a multiplier that effects memory needed per GPU.

Training Job Pending#

A job may be stuck in TrainingJobPending when job specification cannot be satisfied with cluster resources. Container events and logs can help troubleshoot or troubleshoot directly with Kubernetes if you have access to the Kubernetes cluster.

PVC#

A job failing to schedule with a message 0 nodes are available: pod has unbound immediate PersistentVolumeClaims can be caused by cluster resource constraints for Persistent Volume Claims (PVC). Check the resource quota for PVCs in a given namespace that may limit the disk space per PVC or the number of PVCs that can be created.

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  8m51s (x9 over 49m)    default-scheduler  0/10 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/10 nodes are available: 10 Preemption is not helpful for scheduling.

Steps to resolve resource constraints for jobs can include:

Modify the PVC storage size configured for each job to be within namespace limits with the Helm value customizerConfig.training.pvc.size.
Delete zombie jobs that have stopped, but are holding onto PVC, in order to release resources for new jobs.
```
kubectl get nemotrainingjob
kubectl delete nemotrainingjob <job_id(s)>
```

GPUs#

A job failing to schedule with a message all nodes are unavailable: resource fit failed can indicate that the GPUs requested exceeds what can be scheduled on the cluster.

Node-Selectors:  <none>
Tolerations:     app=customizer:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                 nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age   From     Message
  ----     ------            ----  ----     -------
  Warning  FailedScheduling  37s   volcano  all nodes are unavailable: 16 node(s) resource fit failed.

Troubleshoot options for GPU constraints can include:

Inspect the configured num_gpus and num_nodes for each model within the Helm value customizerConfig.models.{model_name}.training_options[] align with your cluster.
- For example, a training option requires 8 GPUs and the nodes in your cluster only has a maximum of 4 GPUs available. Consider modifying the option to request 4 GPUs and 2 nodes so jobs can be scheduled.
- If your cluster can never satisfy the requested GPUs, consider reducing the number of GPUs for a training option or disable the option or model. Be mindful of reducing the number of GPUs for a training option. It may cause jobs to error with CUDA out of memory.
Consider adding a taint to a set of GPU nodes to reserve resources for Customizer jobs. Configure the matching toleration with the Helm value customizerConfig.training.tolerations.

Volcano Scheduler#

CrashLoopBackOff: “too many open files”#

If the Volcano scheduler pod enters a CrashLoopBackOff state with the error message:

panic: failed creating filewatcher for /volcano.scheduler/volcano-scheduler.conf: too many open files

Symptoms:

The volcano-scheduler pod in the volcano-system namespace repeatedly crashes and never becomes ready.
Customization jobs are never scheduled and fail immediately.

Reason:

The scheduler pod hits the file descriptor limit, causing it to crash and preventing it from coming online.

Troubleshooting Steps:

Delete the crashing Volcano scheduler pod and wait for the deployment to become available. This can help recover from the crash loop:

kubectl delete pod -l app=volcano-scheduler -n volcano-system
kubectl wait --for=condition=Available deployment/volcano-scheduler -n volcano-system

If the issue persists, check the node’s file descriptor limits and consider increasing them.