> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo-platform/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo-platform/_mcp/server.

# Troubleshooting NeMo Customizer

**Job fails during model download:**

* Verify the HuggingFace token secret is configured correctly
* Accept the model's license on the [HuggingFace model page](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
* Check job status: `client.customization.jobs.retrieve(name=job.name, workspace="default")`

**Job fails with disk full or 500 error when retrieving logs:**

* The platform's shared persistent volume is likely full. Customization jobs require significant disk space: \~3× model size for full SFT, \~1.5× for LoRA. If you are also deploying the model from a base checkpoint fileset, plan for \~2.5× model size overall.
* Clean up completed job artifacts or increase the PVC size (default: 200Gi at `/var/run/scratch/job`).
* DPO/GRPO jobs also consume ephemeral node storage under `/tmp` via Ray workers — check node disk in addition to the PVC.
* See [ft-tut-understand-models](/documentation/customizer-reference/tutorials/understanding-models-and-training) for full storage requirement details.

**Job fails with OOM (Out of Memory):**

1. Reduce `micro_batch_size` to 1
2. Reduce `batch_size`
3. Reduce `max_seq_length` from 2048 to 1024 or 512

**Training loss not decreasing:**

* Increase `learning_rate` (try 2e-4 or 5e-4)
* Increase `epochs`
* Verify data quality -- inspect a few training examples manually

**Tool calling accuracy is low after fine-tuning:**

* Increase training data size (sample more from the filtered dataset)
* Increase `epochs` from 2 to 3-4
* Check that the evaluation dataset format matches what the model expects
* Verify the base model supports tool calling (Llama 3.2 Instruct does)

**Deployment fails:**

* Verify the base model and adapter exist: `client.models.retrieve(name=MODEL_NAME, workspace="default")` -- the LoRA adapter appears in the base model's `adapters` list, not as a separate model entity
* Check deployment logs: `client.inference.deployments.get_logs(name=deployment.name, workspace="default")`
* Ensure sufficient GPU resources for the model size