Troubleshooting NeMo Customizer#
Use this documentation to troubleshoot issues that can arise when you work with NVIDIA NeMo Customizer.
Status Details#
View the status details of the job with the following endpoint. The response field status_details.status_logs
contains events for the job.
curl -X GET \
"https://${CUSTOMIZER_HOSTNAME}/v1/customization/jobs/{id}" \
-H 'Accept: application/json' | jq
Example Response
{
"id": "cust-YbmGLDpnZUPMGjqKZ2MaUy",
"created_at": "2025-03-18T17:03:56.789974",
"updated_at": "2025-03-18T17:03:56.789983",
"config": {},
"dataset": "namespace/dataset-name",
"hyperparameters": {},
"output_model": "namespace/model-name@cust-YbmGLDpnZUPMGjqKZ2MaUy",
"status": "completed",
"status_details": {
"created_at": "2025-03-18T17:03:56.789304",
"updated_at": "2025-03-18T17:11:28.812268",
"steps_completed": 242,
"epochs_completed": 1,
"percentage_done": 100,
"status_logs": [
{
"updated_at": "2025-03-18T17:03:56",
"message": "created",
},
{
"updated_at": "2025-03-18T17:04:03",
"message": "PVCCreated"
},
{
"updated_at": "2025-03-18T17:04:03",
"message": "EntityHandler_0_Created"
}
{
"updated_at": "2025-03-18T17:04:03",
"message": "EntityHandler_0_Pending"
},
{
"updated_at": "2025-03-18T17:04:04",
"message": "EntityHandler_0_Running"
},
{
"updated_at": "2025-03-18T17:04:49",
"message": "EntityHandler_0_Completed"
},
{
"updated_at": "2025-03-18T17:04:50",
"message": "TrainingJobCreated"
},
{
"updated_at": "2025-03-18T17:04:50",
"message": "TrainingJobPending"
},
{
"updated_at": "2025-03-18T17:06:57",
"message": "TrainingJobRunning"
},
{
"updated_at": "2025-03-18T17:12:01",
"message": "TrainingJobCompleted"
},
{
"updated_at": "2025-03-18T17:12:01",
"message": "EntityHandler_1_Created"
},
{
"updated_at": "2025-03-18T17:12:01",
"message": "EntityHandler_1_Running"
},
{
"updated_at": "2025-03-18T17:12:48",
"message": "EntityHandler_1_Pending"
},
{
"updated_at": "2025-03-18T17:12:48",
"message": "EntityHandler_1_Completed"
}
]
}
}
A training job comprises 3 stages and they are captured in the status logs.
Download dataset stage is logged with messages prefixed with
EntityHandler_0
.Train custom model stage is logged with messages prefixed with
TrainingJob
.Upload custom model stage is logged with messages prefixed with
EntityHandler_1
.
A training job is broken up into stages that are queued one after another as one stage completes. The job status running
indicates the job has been created, scheduled, and the first stage is running or has already completed.
The second stage to train the custom model requires GPUs and may be waiting in a queue until resources are available to be scheduled. This is indicated with TrainingJobPending
. When the training stage is scheduled, the status logs will update with a TrainingJobRunning
entry.
The last stage uploads the trained model to NeMo Data Store. The custom model is available in Entity Store for NIM to load from once the overall job status is marked completed
.
Failed Job#
You can troubleshoot a failed job by viewing the status_logs
for reported errors that occurred during training. Some errors may be resolved by modifying your job parameters. Some errors should be reported to the administrator managing the NeMo Customizer microservice. In other cases, please report a bug for errors that cannot be resolved.
View error(s) reported for a failed job using the /v1/customization/jobs/{id}
endpoint.
curl -X GET \
"https://${CUSTOMIZER_HOSTNAME}/v1/customization/jobs/{id}" \
-H 'Accept: application/json' | jq
Example Response
{
"id": "cust-YbmGLDpnZUPMGjqKZ2MaUy",
"status": "failed",
"status_details": {
"created_at": "2025-03-18T17:03:56.789304",
"updated_at": "2025-03-18T17:11:28.812268",
"steps_completed": 242,
"epochs_completed": 1,
"percentage_done": 25,
"status_logs": [
{
"updated_at": "2025-03-18T17:53:40",
"message": "created"
},
{
"updated_at": "2025-03-18T17:53:42",
"message": "PVCCreated"
},
{
"updated_at": "2025-03-18T17:53:42",
"message": "EntityHandler_0_Created"
},
{
"updated_at": "2025-03-18T17:54:13",
"message": "EntityHandler_0_Pending"
},
{
"updated_at": "2025-03-18T17:54:13",
"message": "EntityHandler_0_Completed"
},
{
"updated_at": "2025-03-18T17:54:13",
"message": "TrainingJobCreated"
},
{
"updated_at": "2025-03-18T17:54:18",
"message": "TrainingJobRunning"
},
{
"updated_at": "2025-03-18T19:03:41.757466",
"message": "DataLoader worker (pid 2266) is killed by signal: Terminated.",
"detail": "Traceback (most recent call last):\n File \"/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py\", line 46, in _call_and_handle_interrupt\""
},
{
"updated_at": "2025-03-18T19:08:42.358929",
"message": "failed"
}
]
}
}
Container Logs#
Container logs for the job are available after the job ends through the API endpoint /v1/customization/jobs/{id}/container-logs
. The response include events and logs for all stages of the job that have run. Container logs are useful to troubleshoot unexpected behaviors and errors or inspect inconsistencies with training.
curl -X GET \
"https://${CUSTOMIZER_HOSTNAME}/v1/customization/jobs/{id}/container-logs" \
-H 'Accept: application/json' | jq
Example Response
{
"error": null,
"entries": {
"nvidia.com/v1alpha1:NemoTrainingJob:cust-75fzuqp4bkvbzzgob7oelw": {},
"batch/v1:job:cust-75fzuqp4bkvbzzgob7oelw-entity-handler-0": {
"object": {},
"events": []
},
"core/v1:pod:cust-75fzuqp4bkvbzzgob7oelw-entity-handler-0-xmhp5": {
"object": {},
"events": [],
"logs": {
"main": "\rFetching 4 files: 0%| | 0/4 [00:00<?, ?it/s]Downloading"
},
"status": {
"conditions": [],
"container_statuses": []
}
},
"batch/v1:job:cust-75fzuqp4bkvbzzgob7oelw-entity-handler-1": {
"logs": {
"main": "Start hashing 16 files.\nFinished hashing 16 files.\nUploading adapter_config.json"
}
},
"batch.volcano.sh/v1alpha1:Job:cust-75fzuqp4bkvbzzgob7oelw-training-job": {},
"core/v1:pod:cust-75fzuqp4bkvbzzgob7oelw-training-job-worker-0": {
"logs": {
"main": "<training logs>"
}
},
"core/v1:pvc:cust-75fzuqp4bkvbzzgob7oelw": {
"events": [
{
"count": 1,
"first_seen": "2025-03-19T20:33:10+00:00",
"last_seen": "2025-03-19T20:33:10+00:00",
"message": "Successfully provisioned volume csi-fss-f1dc1844-a59c-4490-b6d2-d2ad43b0d80f",
"reason": "ProvisioningSucceeded",
"type": "Normal"
},
{
"count": 2,
"first_seen": "2025-03-19T20:32:53+00:00",
"last_seen": "2025-03-19T20:33:03+00:00",
"message": "Waiting for a volume to be created either by the external provisioner 'fss.csi.oraclecloud.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.",
"reason": "ExternalProvisioning",
"type": "Normal"
},
{
"count": 1,
"first_seen": "2025-03-19T20:32:53+00:00",
"last_seen": "2025-03-19T20:32:53+00:00",
"message": "External provisioner is provisioning volume for claim \"cust-75fzuqp4bkvbzzgob7oelw\"",
"reason": "Provisioning",
"type": "Normal"
}
]
}
}
}
Common Errors#
The following tables capture common errors and how to troubleshoot the problem for different components of the job and platform.
Dataset#
Error Status |
Error Message |
Reason |
Troubleshoot |
---|---|---|---|
422 |
|
The dataset specified for the training job is not found in Entity Store. |
Verify the dataset exists in Entity Store, in addition to NeMo Data Store. |
422 |
|
The dataset specified for the training job in Entity Store points to |
Verify the dataset exists in NeMo Data Store at the |
422 |
|
The backend storage of |
Update the |
Job failed |
|
The dataset contains invalid JSON line(s). |
Fix the dataset to contain valid JSON lines, reupload the file, and create a new job. |
Job failed |
|
The dataset does not conform to a supported schema for training. |
Reformat the dataset to a supported schema, reupload the file, and create a new job. Visit the Format Training Dataset tutorial for more information. |
Job failed |
|
The validation dataset is too small to evaluate validation loss with the specified batch size for the job. |
Reduce the batch size according to the validation dataset size or increase the number of samples in the validation dataset. |
Output Model#
Error Status |
Error Message |
Reason |
Troubleshoot |
---|---|---|---|
409 |
|
The specified |
Update the requested |
Base Model#
Error Status |
Error Message |
Reason |
Troubleshoot |
---|---|---|---|
422 |
|
The base model is not available in Customizer to train a custom model with. |
View the list of available models from the endpoint |
409 |
|
The base model to train the custom model is downloading to the model cache and is not ready to train with. |
Retry the job request at a later time. Retry after a couple of minutes for smaller models. It may take up to an hour or longer for large models. |
500 |
|
The base model to train the custom model doesn’t exist in the models cache due to a download error. |
Check application logs for download errors and view the downloader pod logs. Verify |
Note
Currently, the only way to trigger new model download jobs (after they’ve failed) is to restart the NVIDIA Customizer deployment. Your administrator can run kubectl rollout restart
or delete the NVIDIA Customizer API pod.
Out Of Memory#
CUDA out of memory. Tried to allocate XXX GiB. GPU 0 has a total capacity of XXX GiB of which XXX GiB is free.
A job failed from CUDA out of memory
error can occur when the resources allocated for the job is insufficient for the model and training parameters. If this occurs for a variety of jobs with reasonable set of hyperparameters and dataset sizes, then the administrator managing the NeMo Customizer microservice can look to modify the resources configured, like number of GPUs and nodes. Consider reducing the micro batch size, which is a multiplier that effects memory needed per GPU.
Training Job Pending#
A job may be stuck in TrainingJobPending
when job specification cannot be satisfied with cluster resources. Container events and logs can help troubleshoot or troubleshoot directly with Kubernetes if you have access to the Kubernetes cluster.
PVC#
A job failing to schedule with a message 0 nodes are available: pod has unbound immediate PersistentVolumeClaims
can be caused by cluster resource constraints for Persistent Volume Claims (PVC). Check the resource quota for PVCs in a given namespace that may limit the disk space per PVC or the number of PVCs that can be created.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 8m51s (x9 over 49m) default-scheduler 0/10 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/10 nodes are available: 10 Preemption is not helpful for scheduling.
Steps to resolve resource constraints for jobs can include:
Modify the PVC storage size configured for each job to be within namespace limits with the Helm value
customizerConfig.training.pvc.size
.Delete zombie jobs that have stopped, but are holding onto PVC, in order to release resources for new jobs.
kubectl get nemotrainingjob kubectl delete nemotrainingjob <job_id(s)>
GPUs#
A job failing to schedule with a message all nodes are unavailable: resource fit failed
can indicate that the GPUs requested exceeds what can be scheduled on the cluster.
Node-Selectors: <none>
Tolerations: app=customizer:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 37s volcano all nodes are unavailable: 16 node(s) resource fit failed.
Troubleshoot options for GPU constraints can include:
Inspect the configured
num_gpus
andnum_nodes
for each model within the Helm valuecustomizerConfig.models.{model_name}.training_options[]
align with your cluster.For example, a training option requires 8 GPUs and the nodes in your cluster only has a maximum of 4 GPUs available. Consider modifying the option to request 4 GPUs and 2 nodes so jobs can be scheduled.
If your cluster can never satisfy the requested GPUs, consider reducing the number of GPUs for a training option or disable the option or model. Be mindful of reducing the number of GPUs for a training option. It may cause jobs to error with CUDA out of memory.
Consider adding a taint to a set of GPU nodes to reserve resources for Customizer jobs. Configure the matching toleration with the Helm value
customizerConfig.training.tolerations
.
NIM#
Unsupported model
error with status code 400
when chatting or prompting the model occurs when NIM does not have reference of the custom model.
curl https://${NIM_HOSTNAME}/v1/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "model-namespace/custom-model@version",
"prompt": "hello world"
}' | jq
{
"message": "Unsupported model",
"request_id": "44552dd4-1a51-9495-8ccc-967942068bee"
}
NIM deploys LoRA models trained from Customizer automatically. This error can occur if the model has not loaded yet and you can retry after a couple of seconds for the LoRA model to be ready. The endpoint /v1/models
can be used to check whether the model is loaded by NIM.
If the Unsupported model
error persists:
Verify the base model specified for the custom model in Entity Store has a matching NIM deployed.
Verify the
model
name in the inference request with NIM includes the namespace and it corresponds to the namespace used to create the model.