Troubleshooting NeMo Customizer#

Job fails during model download:

Verify the HuggingFace token secret is configured correctly
Accept the model’s license on the HuggingFace model page
Check job status: sdk.customization.jobs.retrieve(name=job.name, workspace="default")

Job fails with disk full or 500 error when retrieving logs:

The platform’s shared persistent volume is likely full. Customization jobs require significant disk space: ~3× model size for full SFT, ~1.5× for LoRA. If you are also deploying the model from a base checkpoint fileset, plan for ~2.5× model size overall.
Clean up completed job artifacts or increase the PVC size (default: 200Gi at /var/run/scratch/job).
DPO/GRPO jobs also consume ephemeral node storage under /tmp via Ray workers — check node disk in addition to the PVC.
See Understanding NeMo Customizer: Models, Training, and Resources for full storage requirement details.

Job fails with OOM (Out of Memory):

Training loss not decreasing:

Tool calling accuracy is low after fine-tuning:

Deployment fails:

Verify the base model and adapter exist: sdk.models.retrieve(name=MODEL_NAME, workspace="default") – the LoRA adapter appears in the base model’s adapters list, not as a separate model entity
Check deployment logs: sdk.inference.deployments.get_logs(name=deployment.name, workspace="default")
Ensure sufficient GPU resources for the model size