Troubleshooting NeMo Customizer
Job fails during model download:
- Verify the HuggingFace token secret is configured correctly
- Accept the model’s license on the HuggingFace model page
- Check job status:
client.customization.jobs.retrieve(name=job.name, workspace="default")
Job fails with disk full or 500 error when retrieving logs:
- The platform’s shared persistent volume is likely full. Customization jobs require significant disk space: ~3× model size for full SFT, ~1.5× for LoRA. If you are also deploying the model from a base checkpoint fileset, plan for ~2.5× model size overall.
- Clean up completed job artifacts or increase the PVC size (default: 200Gi at
/var/run/scratch/job). - DPO/GRPO jobs also consume ephemeral node storage under
/tmpvia Ray workers — check node disk in addition to the PVC. - See ft-tut-understand-models for full storage requirement details.
Job fails with OOM (Out of Memory):
- Reduce
micro_batch_sizeto 1 - Reduce
batch_size - Reduce
max_seq_lengthfrom 2048 to 1024 or 512
Training loss not decreasing:
- Increase
learning_rate(try 2e-4 or 5e-4) - Increase
epochs - Verify data quality — inspect a few training examples manually
Tool calling accuracy is low after fine-tuning:
- Increase training data size (sample more from the filtered dataset)
- Increase
epochsfrom 2 to 3-4 - Check that the evaluation dataset format matches what the model expects
- Verify the base model supports tool calling (Llama 3.2 Instruct does)
Deployment fails:
- Verify the base model and adapter exist:
client.models.retrieve(name=MODEL_NAME, workspace="default")— the LoRA adapter appears in the base model’sadapterslist, not as a separate model entity - Check deployment logs:
client.inference.deployments.get_logs(name=deployment.name, workspace="default") - Ensure sufficient GPU resources for the model size