Troubleshooting NeMo Customizer#

Job fails during model download:

  • Verify the HuggingFace token secret is configured correctly

  • Accept the model’s license on the HuggingFace model page

  • Check job status: sdk.customization.jobs.retrieve(name=job.name, workspace="default")

Job fails with disk full or 500 error when retrieving logs:

  • The platform’s shared persistent volume is likely full. Customization jobs require significant disk space: ~3× model size for full SFT, ~1.5× for LoRA. If you are also deploying the model from a base checkpoint fileset, plan for ~2.5× model size overall.

  • Clean up completed job artifacts or increase the PVC size (default: 200Gi at /var/run/scratch/job).

  • DPO/GRPO jobs also consume ephemeral node storage under /tmp via Ray workers — check node disk in addition to the PVC.

  • See Understanding NeMo Customizer: Models, Training, and Resources for full storage requirement details.

Job fails with OOM (Out of Memory):

  1. Reduce micro_batch_size to 1

  2. Reduce batch_size

  3. Reduce max_seq_length from 2048 to 1024 or 512

Training loss not decreasing:

  • Increase learning_rate (try 2e-4 or 5e-4)

  • Increase epochs

  • Verify data quality – inspect a few training examples manually

Tool calling accuracy is low after fine-tuning:

  • Increase training data size (sample more from the filtered dataset)

  • Increase epochs from 2 to 3-4

  • Check that the evaluation dataset format matches what the model expects

  • Verify the base model supports tool calling (Llama 3.2 Instruct does)

Deployment fails:

  • Verify the base model and adapter exist: sdk.models.retrieve(name=MODEL_NAME, workspace="default") – the LoRA adapter appears in the base model’s adapters list, not as a separate model entity

  • Check deployment logs: sdk.inference.deployments.get_logs(name=deployment.name, workspace="default")

  • Ensure sufficient GPU resources for the model size