Troubleshooting NeMo Customizer#

Use this documentation to troubleshoot issues that can arise when you work with NVIDIA NeMo Customizer.

Status Details#

View the status details of the job with the following endpoint. The response field status_details.status_logs contains events for the job.

curl -X GET \
    "https://${CUSTOMIZER_HOSTNAME}/v1/customization/jobs/{id}" \
    -H 'Accept: application/json' | jq
Example Response
{
  "id": "cust-YbmGLDpnZUPMGjqKZ2MaUy",
  "created_at": "2025-03-18T17:03:56.789974",
  "updated_at": "2025-03-18T17:03:56.789983",
  "config": {},
  "dataset": "namespace/dataset-name",
  "hyperparameters": {},
  "output_model": "namespace/model-name@cust-YbmGLDpnZUPMGjqKZ2MaUy",
  "status": "completed",
  "status_details": {
    "created_at": "2025-03-18T17:03:56.789304",
    "updated_at": "2025-03-18T17:11:28.812268",
    "steps_completed": 242,
    "epochs_completed": 1,
    "percentage_done": 100,
    "status_logs": [
      {
        "updated_at": "2025-03-18T17:03:56",
        "message": "created",
      },
      {
        "updated_at": "2025-03-18T17:04:03",
        "message": "PVCCreated"
      },
      {
        "updated_at": "2025-03-18T17:04:03",
        "message": "EntityHandler_0_Created"
      }
      {
        "updated_at": "2025-03-18T17:04:03",
        "message": "EntityHandler_0_Pending"
      },
      {
        "updated_at": "2025-03-18T17:04:04",
        "message": "EntityHandler_0_Running"
      },
      {
        "updated_at": "2025-03-18T17:04:49",
        "message": "EntityHandler_0_Completed"
      },
      {
        "updated_at": "2025-03-18T17:04:50",
        "message": "TrainingJobCreated"
      },
      {
        "updated_at": "2025-03-18T17:04:50",
        "message": "TrainingJobPending"
      },
      {
        "updated_at": "2025-03-18T17:06:57",
        "message": "TrainingJobRunning"
      },
      {
        "updated_at": "2025-03-18T17:12:01",
        "message": "TrainingJobCompleted"
      },
      {
        "updated_at": "2025-03-18T17:12:01",
        "message": "EntityHandler_1_Created"
      },
      {
        "updated_at": "2025-03-18T17:12:01",
        "message": "EntityHandler_1_Running"
      },
      {
        "updated_at": "2025-03-18T17:12:48",
        "message": "EntityHandler_1_Pending"
      },
      {
        "updated_at": "2025-03-18T17:12:48",
        "message": "EntityHandler_1_Completed"
      }
    ]
  }
}

A training job comprises 3 stages and they are captured in the status logs.

  1. Download dataset stage is logged with messages prefixed with EntityHandler_0.

  2. Train custom model stage is logged with messages prefixed with TrainingJob.

  3. Upload custom model stage is logged with messages prefixed with EntityHandler_1.

A training job is broken up into stages that are queued one after another as one stage completes. The job status running indicates the job has been created, scheduled, and the first stage is running or has already completed.

The second stage to train the custom model requires GPUs and may be waiting in a queue until resources are available to be scheduled. This is indicated with TrainingJobPending. When the training stage is scheduled, the status logs will update with a TrainingJobRunning entry.

The last stage uploads the trained model to NeMo Data Store. The custom model is available in Entity Store for NIM to load from once the overall job status is marked completed.

Failed Job#

You can troubleshoot a failed job by viewing the status_logs for reported errors that occurred during training. Some errors may be resolved by modifying your job parameters. Some errors should be reported to the administrator managing the NeMo Customizer microservice. In other cases, please report a bug for errors that cannot be resolved.

View error(s) reported for a failed job using the /v1/customization/jobs/{id} endpoint.

curl -X GET \
    "https://${CUSTOMIZER_HOSTNAME}/v1/customization/jobs/{id}" \
    -H 'Accept: application/json' | jq
Example Response
{
  "id": "cust-YbmGLDpnZUPMGjqKZ2MaUy",
  "status": "failed",
  "status_details": {
    "created_at": "2025-03-18T17:03:56.789304",
    "updated_at": "2025-03-18T17:11:28.812268",
    "steps_completed": 242,
    "epochs_completed": 1,
    "percentage_done": 25,
    "status_logs": [
      {
        "updated_at": "2025-03-18T17:53:40",
        "message": "created"
      },
      {
        "updated_at": "2025-03-18T17:53:42",
        "message": "PVCCreated"
      },
      {
        "updated_at": "2025-03-18T17:53:42",
        "message": "EntityHandler_0_Created"
      },
      {
        "updated_at": "2025-03-18T17:54:13",
        "message": "EntityHandler_0_Pending"
      },
      {
        "updated_at": "2025-03-18T17:54:13",
        "message": "EntityHandler_0_Completed"
      },
      {
        "updated_at": "2025-03-18T17:54:13",
        "message": "TrainingJobCreated"
      },
      {
        "updated_at": "2025-03-18T17:54:18",
        "message": "TrainingJobRunning"
      },
      {
        "updated_at": "2025-03-18T19:03:41.757466",
        "message": "DataLoader worker (pid 2266) is killed by signal: Terminated.",
        "detail": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py\", line 46, in _call_and_handle_interrupt\""
      },
      {
        "updated_at": "2025-03-18T19:08:42.358929",
        "message": "failed"
      }
    ]
  }
}

Container Logs#

Container logs for the job are available after the job ends through the API endpoint /v1/customization/jobs/{id}/container-logs. The response include events and logs for all stages of the job that have run. Container logs are useful to troubleshoot unexpected behaviors and errors or inspect inconsistencies with training.

curl -X GET \
  "https://${CUSTOMIZER_HOSTNAME}/v1/customization/jobs/{id}/container-logs" \
  -H 'Accept: application/json' | jq
Example Response
{
  "error": null,
  "entries": {
    "nvidia.com/v1alpha1:NemoTrainingJob:cust-75fzuqp4bkvbzzgob7oelw": {},
    "batch/v1:job:cust-75fzuqp4bkvbzzgob7oelw-entity-handler-0": {
      "object": {},
      "events": []
    },
    "core/v1:pod:cust-75fzuqp4bkvbzzgob7oelw-entity-handler-0-xmhp5": {
      "object": {},
      "events": [],
      "logs": {
        "main": "\rFetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]Downloading"
      },
      "status": {
        "conditions": [],
        "container_statuses": []
      }
    },
   "batch/v1:job:cust-75fzuqp4bkvbzzgob7oelw-entity-handler-1": {
      "logs": {
        "main": "Start hashing 16 files.\nFinished hashing 16 files.\nUploading adapter_config.json"
      }
    },
    "batch.volcano.sh/v1alpha1:Job:cust-75fzuqp4bkvbzzgob7oelw-training-job": {},
    "core/v1:pod:cust-75fzuqp4bkvbzzgob7oelw-training-job-worker-0": {
      "logs": {
        "main": "<training logs>"
      }
    },
    "core/v1:pvc:cust-75fzuqp4bkvbzzgob7oelw": {
      "events": [
        {
          "count": 1,
          "first_seen": "2025-03-19T20:33:10+00:00",
          "last_seen": "2025-03-19T20:33:10+00:00",
          "message": "Successfully provisioned volume csi-fss-f1dc1844-a59c-4490-b6d2-d2ad43b0d80f",
          "reason": "ProvisioningSucceeded",
          "type": "Normal"
        },
        {
          "count": 2,
          "first_seen": "2025-03-19T20:32:53+00:00",
          "last_seen": "2025-03-19T20:33:03+00:00",
          "message": "Waiting for a volume to be created either by the external provisioner 'fss.csi.oraclecloud.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.",
          "reason": "ExternalProvisioning",
          "type": "Normal"
        },
        {
          "count": 1,
          "first_seen": "2025-03-19T20:32:53+00:00",
          "last_seen": "2025-03-19T20:32:53+00:00",
          "message": "External provisioner is provisioning volume for claim \"cust-75fzuqp4bkvbzzgob7oelw\"",
          "reason": "Provisioning",
          "type": "Normal"
        }
      ]
    }
  }
}

Common Errors#

The following tables capture common errors and how to troubleshoot the problem for different components of the job and platform.

Dataset#

Error Status

Error Message

Reason

Troubleshoot

422

{dataset-namespace}/{dataset-name} is not found or not available for customization.

The dataset specified for the training job is not found in Entity Store.

Verify the dataset exists in Entity Store, in addition to NeMo Data Store.

422

hf://datasets/{dataset-namespace}/{dataset-name} is not found or not available for customization.

The dataset specified for the training job in Entity Store points to files_url that is not found in NeMo Data Store, the backing data storage.

Verify the dataset exists in NeMo Data Store at the files_url location. Visit Create Dataset for instructions to upload dataset files or update the files_url in Entity Store to the correct path in NeMo Data Store.

422

Error validating dataset, unsupported data store expected 'hf://datasets/': {files_url}

The backend storage of files_url for the dataset in Entity Store is not supported.

Update the files_url to a valid NeMo Data Store dataset. The files_url must match the format hf://{namespace}/{name}@{version}.

Job failed

{filepath} has entry which is not valid json

The dataset contains invalid JSON line(s).

Fix the dataset to contain valid JSON lines, reupload the file, and create a new job.

Job failed

Parsed jsonl line does not conform to schema

The dataset does not conform to a supported schema for training.

Reformat the dataset to a supported schema, reupload the file, and create a new job. Visit the Format Training Dataset tutorial for more information.

Job failed

Batch size {batch_size} cannot be larger than number of samples for validation {num_samples}

The validation dataset is too small to evaluate validation loss with the specified batch size for the job.

Reduce the batch size according to the validation dataset size or increase the number of samples in the validation dataset.

Output Model#

Error Status

Error Message

Reason

Troubleshoot

409

Failed to create model repo in Nemo Data Store: Error: model with name and version, {namespace}/{name}@{version} already exists

The specified output_model in the job request already exists.

Update the requested output_model to a new model name or update the version when using the same model name. If the version is omitted, the job ID is the default version used for the output model.

Base Model#

Error Status

Error Message

Reason

Troubleshoot

422

{model-name} is not configured for training

The base model is not available in Customizer to train a custom model with.

View the list of available models from the endpoint /v1/customization/configs and update the config value in the job request. If the model is expected to be available in Customizer, contact your administrator to enable the model for training from the deployment configuration.

409

Model {model-name} is downloading to cache, try again later

The base model to train the custom model is downloading to the model cache and is not ready to train with.

Retry the job request at a later time. Retry after a couple of minutes for smaller models. It may take up to an hour or longer for large models.

500

Model {model-name} errored during download, contact your administrator

The base model to train the custom model doesn’t exist in the models cache due to a download error.

Check application logs for download errors and view the downloader pod logs. Verify model_uri is correct. For NGC authentication issues, ensure NGC_API_KEY is set for the NVIDIA Customizer deployment and ngcAPISecret/ngcAPISecretKey Helm values point to a valid NGC API key with model download permissions. Verify the downloader has write permissions to the model cache PVC. Restart the application to trigger a new download. For storage issues, increase your models cache capacity (default is 20 GB) to accommodate your enabled models.

Note

Currently, the only way to trigger new model download jobs (after they’ve failed) is to restart the NVIDIA Customizer deployment. Your administrator can run kubectl rollout restart or delete the NVIDIA Customizer API pod.

Out Of Memory#

CUDA out of memory. Tried to allocate XXX GiB. GPU 0 has a total capacity of XXX GiB of which XXX GiB is free.

A job failed from CUDA out of memory error can occur when the resources allocated for the job is insufficient for the model and training parameters. If this occurs for a variety of jobs with reasonable set of hyperparameters and dataset sizes, then the administrator managing the NeMo Customizer microservice can look to modify the resources configured, like number of GPUs and nodes. Consider reducing the micro batch size, which is a multiplier that effects memory needed per GPU.

Training Job Pending#

A job may be stuck in TrainingJobPending when job specification cannot be satisfied with cluster resources. Container events and logs can help troubleshoot or troubleshoot directly with Kubernetes if you have access to the Kubernetes cluster.

PVC#

A job failing to schedule with a message 0 nodes are available: pod has unbound immediate PersistentVolumeClaims can be caused by cluster resource constraints for Persistent Volume Claims (PVC). Check the resource quota for PVCs in a given namespace that may limit the disk space per PVC or the number of PVCs that can be created.

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  8m51s (x9 over 49m)    default-scheduler  0/10 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/10 nodes are available: 10 Preemption is not helpful for scheduling.

Steps to resolve resource constraints for jobs can include:

  • Modify the PVC storage size configured for each job to be within namespace limits with the Helm value customizerConfig.training.pvc.size.

  • Delete zombie jobs that have stopped, but are holding onto PVC, in order to release resources for new jobs.

    kubectl get nemotrainingjob
    kubectl delete nemotrainingjob <job_id(s)>
    

GPUs#

A job failing to schedule with a message all nodes are unavailable: resource fit failed can indicate that the GPUs requested exceeds what can be scheduled on the cluster.

Node-Selectors:  <none>
Tolerations:     app=customizer:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                 nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age   From     Message
  ----     ------            ----  ----     -------
  Warning  FailedScheduling  37s   volcano  all nodes are unavailable: 16 node(s) resource fit failed.

Troubleshoot options for GPU constraints can include:

  • Inspect the configured num_gpus and num_nodes for each model within the Helm value customizerConfig.models.{model_name}.training_options[] align with your cluster.

    • For example, a training option requires 8 GPUs and the nodes in your cluster only has a maximum of 4 GPUs available. Consider modifying the option to request 4 GPUs and 2 nodes so jobs can be scheduled.

    • If your cluster can never satisfy the requested GPUs, consider reducing the number of GPUs for a training option or disable the option or model. Be mindful of reducing the number of GPUs for a training option. It may cause jobs to error with CUDA out of memory.

  • Consider adding a taint to a set of GPU nodes to reserve resources for Customizer jobs. Configure the matching toleration with the Helm value customizerConfig.training.tolerations.

NIM#

Unsupported model error with status code 400 when chatting or prompting the model occurs when NIM does not have reference of the custom model.

curl https://${NIM_HOSTNAME}/v1/completions \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
        "model": "model-namespace/custom-model@version",
        "prompt": "hello world"
    }' | jq
{
  "message": "Unsupported model",
  "request_id": "44552dd4-1a51-9495-8ccc-967942068bee"
}

NIM deploys LoRA models trained from Customizer automatically. This error can occur if the model has not loaded yet and you can retry after a couple of seconds for the LoRA model to be ready. The endpoint /v1/models can be used to check whether the model is loaded by NIM.

If the Unsupported model error persists:

  • Verify the base model specified for the custom model in Entity Store has a matching NIM deployed.

  • Verify the model name in the inference request with NIM includes the namespace and it corresponds to the namespace used to create the model.