Troubleshooting NeMo Customizer#
Job fails during model download:
Verify the HuggingFace token secret is configured correctly
Accept the model’s license on the HuggingFace model page
Check job status:
sdk.customization.jobs.retrieve(name=job.name, workspace="default")
Job fails with disk full or 500 error when retrieving logs:
The platform’s shared persistent volume is likely full. Customization jobs require significant disk space: ~3× model size for full SFT, ~1.5× for LoRA. If you are also deploying the model from a base checkpoint fileset, plan for ~2.5× model size overall.
Clean up completed job artifacts or increase the PVC size (default: 200Gi at
/var/run/scratch/job).DPO/GRPO jobs also consume ephemeral node storage under
/tmpvia Ray workers — check node disk in addition to the PVC.See Understanding NeMo Customizer: Models, Training, and Resources for full storage requirement details.
Job fails with OOM (Out of Memory):
Reduce
micro_batch_sizeto 1Reduce
batch_sizeReduce
max_seq_lengthfrom 2048 to 1024 or 512
Training loss not decreasing:
Increase
learning_rate(try 2e-4 or 5e-4)Increase
epochsVerify data quality – inspect a few training examples manually
Tool calling accuracy is low after fine-tuning:
Increase training data size (sample more from the filtered dataset)
Increase
epochsfrom 2 to 3-4Check that the evaluation dataset format matches what the model expects
Verify the base model supports tool calling (Llama 3.2 Instruct does)
Deployment fails:
Verify the base model and adapter exist:
sdk.models.retrieve(name=MODEL_NAME, workspace="default")– the LoRA adapter appears in the base model’sadapterslist, not as a separate model entityCheck deployment logs:
sdk.inference.deployments.get_logs(name=deployment.name, workspace="default")Ensure sufficient GPU resources for the model size