Import and Fine-Tune Private HuggingFace Models
Import and Fine-Tune Private HuggingFace Models
Use this tutorial to learn how to import a private HuggingFace model into NeMo Customizer, fine-tune it with LoRA, and deploy it for inference.
Prerequisites
New to using NeMo Platform?
All platform resources—models, datasets, and more—must belong to a workspace. Workspaces provide organizational and authorization boundaries for your work. Within a workspace, you can optionally use projects to group related resources.
If you’re new to the platform, start with the Setup guide to learn how to deploy and evaluate models, and optimize agents using the platform end-to-end.
If you’re already familiar with workspaces and how to upload datasets to the platform, you can proceed directly with this tutorial.
For more information, see Workspaces and Projects.
Tutorial-Specific Prerequisites
- Access to Data Store and Deployment Manager service
hfcli installed on a machine with internet access (installation instructions).- A HuggingFace model with a compatible architecture. Not all HuggingFace models are compatible with NeMo Customizer. This tutorial uses
gemma-2-2b-itas an example, but success depends on architectural compatibility. - A HuggingFace API token and proper authentication setup.
- Sufficient storage space for the model files (typically 5-50GB depending on model size)
- At least 8GB GPU memory for smaller models, more for larger models
Verify that all required services are running and accessible before proceeding. You can check service health using the health endpoints documented in each service’s API specification.
Known Issues
Conv1D Model Architecture Limitation: Models that use Conv1D layers are not compatible with NeMo Customizer AutoModel LoRA.
Error signature: AttributeError: 'Conv1D' object has no attribute 'config'
Affected models include:
microsoft/DialoGPT-*seriesopenai-gptmodels- Some older
gpt2variants - Other models with Conv1D-based architectures
Root cause: These models use Conv1D layers that lack the linear layers expected by NeMo’s LoRA transformation utilities.
Solution: Use modern transformer architectures instead:
- ✅ Recommended: Llama models (3.1, 3.2, 3.3 series)
- ✅ Recommended: Nemotron models
- ✅ Recommended: Phi models
- ✅ Alternative: Gemma models (used in this tutorial)
For a complete list of tested models, see the Model Catalog.
Download Model From HuggingFace Hub
- Authenticate to HuggingFace using
hf auth login. - Download the model.
Create Model in Data Store
Next, create a model repository in the NeMo Data Store and upload the downloaded model files.
Create Namespace and Model Repository
Upload Model Files to Data Store
Upload the downloaded model files to the Data Store repository:
Create Model Entity in Entity Store
After uploading the model files to the Data Store, create a model entity in the Entity Store to register the model with its metadata and specifications for use in customization jobs.
Deploy the Base Model
Deploy the base model for inference with LoRA adapter support enabled, allowing it to load fine-tuned adapters from customization jobs.
Create Customization Target
Create a customization target that references the uploaded model in the Data Store.
Wait for the model to be downloaded and ready:
Create Customization Configuration
Create a configuration for LoRA fine-tuning:
Prepare Training and Validation Datasets
Before starting the customization job, prepare both training and validation datasets. The validation dataset helps track training progress and reduce overfitting.
Create datasets in JSONL format:
Start Customization Job
Start the LoRA fine-tuning job. The job will create an output artifact with the name specified in output, which you’ll use later to access your fine-tuned model for inference.
Copy the following values from the response:
id(Job ID)spec.output.name
We’ll need them later to monitor the job’s status and access the fine-tuned model.
Check job progress:
Test the Deployed Model
After the customization job has been completed, you can use the output.name to access the fine-tuned model and evaluate its performance. The base model NIM deployment you created earlier will automatically load the LoRA adapter when you specify the LoRA model ID in your inference requests.
The inference endpoints use the inference_base_url configured during client initialization (typically the NIM proxy URL). The base model deployment must be running before you can test inference with LoRA adapters.
If you included a WandB API key, you can view your training results at wandb.ai under the nvidia-nemo-customizer project.
Python SDK
cURL
Next Steps
Learn how to check customization job metrics to monitor the training progress and performance of your fine-tuned model.