Import and Fine-Tune Private Hugging Face Models

Use this tutorial to learn how to import a private Hugging Face model into NeMo Customizer, fine-tune it with LoRA, and deploy it for inference.

Prerequisites

New to using NeMo Platform?

All platform resources—models, datasets, and more—must belong to a workspace. Workspaces provide organizational and authorization boundaries for your work. Within a workspace, you can optionally use projects to group related resources.

If you’re new to the platform, start with the Setup guide to learn how to deploy and evaluate models, and optimize agents using the platform end-to-end.

If you’re already familiar with workspaces and how to upload datasets to the platform, you can proceed directly with this tutorial.

For more information, see Workspaces and Projects.

Tutorial-Specific Prerequisites

Completed the Quickstart to install and deploy NeMo Platform locally.
Installed the Python SDK and any tutorial packages you need in your environment.
A Hugging Face token with access to the private or gated model repository.
A Hugging Face model with a compatible architecture. This tutorial uses google/gemma-2-2b-it as an example, but success depends on architectural compatibility.
Sufficient GPU memory for the model and LoRA training job.

This tutorial uses the current NeMo Platform SDK resources: client.files.filesets, client.models, client.customization.automodel.jobs, and client.inference.gateway.

Known Issues

Conv1D Model Architecture Limitation: Models that use Conv1D layers are not compatible with NeMo Customizer AutoModel LoRA.

Error signature: AttributeError: 'Conv1D' object has no attribute 'config'

Affected models include:

microsoft/DialoGPT-* series
openai-gpt models
Some older gpt2 variants
Other models with Conv1D-based architectures

Root cause: These models use Conv1D layers that lack the linear layers expected by NeMo’s LoRA transformation utilities.

Solution: Use modern transformer architectures instead:

Llama models (3.1, 3.2, 3.3 series)
Nemotron models
Phi models
Gemma models (used in this tutorial)

For a complete list of tested models, see the Model Catalog.

Quick Start

1. Initialize the SDK

The SDK needs your NeMo Platform server URL. By default, this tutorial uses http://localhost:8080.

1 export NMP_BASE_URL=<YOUR_NMP_BASE_URL>
2 export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>

1 import json
2 import os
3 import time
4 import uuid
5 from pathlib import Path
6 
7 from IPython.display import clear_output
8 from nemo_automodel_plugin.schema import (
9     AutomodelJobInput,
10     BatchSpec,
11     DatasetSpec,
12     OptimizerSpec,
13     OutputRequest,
14     ParallelismSpec,
15     ScheduleSpec,
16     TrainingSpec,
17 )
18 from nemo_platform import ConflictError, NeMoPlatform
19 from nemo_platform.types.files import HuggingfaceStorageConfigParam
20 from nemo_platform.types.secrets import PlatformSecretResponse
21 
22 
23 def max_wait_time_checker(seconds: int, job_name: str = ""):
24     start_time = time.time()
25 
26     def check():
27         if time.time() - start_time > seconds:
28             raise TimeoutError(f"{job_name} took longer than {seconds} seconds")
29 
30     return check
31 
32 
33 NMP_BASE_URL = os.environ.get("NMP_BASE_URL", "http://localhost:8080")
34 client = NeMoPlatform(base_url=NMP_BASE_URL, workspace="default")

2. Store the Hugging Face Token

Private and gated Hugging Face repositories require a token. Store it as a NeMo Platform secret and reference that secret from the Hugging Face fileset.

1 def create_or_get_secret(name: str, value: str | None, label: str) -> PlatformSecretResponse:
2     if not value:
3         raise ValueError(f"{label} is not set")
4     try:
5         secret = client.secrets.create(
6             name=name,
7             workspace="default",
8             value=value,
9         )
10         print(f"Created secret: {name}")
11         return secret
12     except ConflictError:
13         print(f"Secret '{name}' already exists, continuing...")
14         return client.secrets.retrieve(name=name, workspace="default")
15 
16 
17 hf_secret = create_or_get_secret("hf-token", os.getenv("HF_TOKEN"), "HF_TOKEN")

3. Create a Model FileSet and Model Entity

Create a Hugging Face-backed fileset for the private model, then register a Model Entity that points to that fileset. Model files are downloaded by the platform when training or deployment needs them.

1 HF_REPO_ID = "google/gemma-2-2b-it"
2 MODEL_NAME = "gemma-2-2b-it"
3 model_storage = HuggingfaceStorageConfigParam(
4     type="huggingface",
5     repo_id=HF_REPO_ID,
6     repo_type="model",
7     token_secret=hf_secret.name,
8 )
9 
10 try:
11     base_model_fs = client.files.filesets.create(
12         workspace="default",
13         name=MODEL_NAME,
14         description=f"Private Hugging Face model {HF_REPO_ID}",
15         storage=model_storage,
16         cache=True,
17     )
18     print(f"Created model fileset: {base_model_fs.name}")
19 except ConflictError:
20     print(f"Model fileset '{MODEL_NAME}' already exists, refreshing Hugging Face settings...")
21     client.files.filesets.delete(workspace="default", name=MODEL_NAME)
22     base_model_fs = client.files.filesets.create(
23         workspace="default",
24         name=MODEL_NAME,
25         description=f"Private Hugging Face model {HF_REPO_ID}",
26         storage=model_storage,
27         cache=True,
28     )
29 
30 try:
31     base_model = client.models.create(
32         workspace="default",
33         name=MODEL_NAME,
34         fileset=f"default/{MODEL_NAME}",
35         trust_remote_code=False,
36     )
37     print(f"Created Model Entity: {base_model.name}")
38 except ConflictError:
39     print(f"Model Entity '{MODEL_NAME}' already exists, updating fileset if needed...")
40     base_model = client.models.update(
41         workspace="default",
42         name=MODEL_NAME,
43         fileset=f"default/{MODEL_NAME}",
44         trust_remote_code=False,
45     )
46 
47 time_check = max_wait_time_checker(600, "ModelSpec")
48 while not base_model.spec:
49     time_check()
50     time.sleep(10)
51     base_model = client.models.retrieve(workspace="default", name=MODEL_NAME)
52 
53 if base_model.spec:
54     base_model.spec.linear_layers = None
55     base_model = client.models.update(
56         workspace="default",
57         name=MODEL_NAME,
58         spec=base_model.spec,
59         verbose=True,
60     )
61 print(base_model.model_dump_json(indent=2))

4. Prepare and Upload Training Data

Create chat-format JSONL files and upload them to a fileset. The file service, not a dataset-specific SDK resource, is the current entry point for training data.

1 DATASET_NAME = f"{MODEL_NAME}-training-data"
2 DATASET_PATH = Path(DATASET_NAME).absolute()
3 DATASET_PATH.mkdir(parents=True, exist_ok=True)
4 
5 training_data = [
6     {
7         "messages": [
8             {"role": "user", "content": "Hello, how are you?"},
9             {"role": "assistant", "content": "I'm doing well. How can I help you today?"},
10         ]
11     },
12     {
13         "messages": [
14             {"role": "user", "content": "What is machine learning?"},
15             {
16                 "role": "assistant",
17                 "content": "Machine learning is a way for computers to learn patterns from data.",
18             },
19         ]
20     },
21     {
22         "messages": [
23             {"role": "user", "content": "Can you help me with Python?"},
24             {"role": "assistant", "content": "Yes. Share the code or problem you want to work on."},
25         ]
26     },
27 ]
28 
29 validation_data = [
30     {
31         "messages": [
32             {"role": "user", "content": "What is artificial intelligence?"},
33             {
34                 "role": "assistant",
35                 "content": "Artificial intelligence is software that performs tasks associated with human intelligence.",
36             },
37         ]
38     },
39     {
40         "messages": [
41             {"role": "user", "content": "Explain renewable energy."},
42             {
43                 "role": "assistant",
44                 "content": "Renewable energy comes from naturally replenished sources such as sunlight, wind, and water.",
45             },
46         ]
47     },
48 ]
49 
50 with open(DATASET_PATH / "training.jsonl", "w", encoding="utf-8") as f:
51     for item in training_data:
52         f.write(json.dumps(item) + "\n")
53 
54 with open(DATASET_PATH / "validation.jsonl", "w", encoding="utf-8") as f:
55     for item in validation_data:
56         f.write(json.dumps(item) + "\n")
57 
58 try:
59     client.files.filesets.create(
60         workspace="default",
61         name=DATASET_NAME,
62         description="Private Hugging Face model LoRA training data",
63         cache=True,
64     )
65     print(f"Created dataset fileset: {DATASET_NAME}")
66 except ConflictError:
67     print(f"Dataset fileset '{DATASET_NAME}' already exists, continuing...")
68 
69 client.files.fsspec.put(
70     lpath=DATASET_PATH,
71     rpath=f"default/{DATASET_NAME}/",
72     recursive=True,
73 )
74 print(client.files.list(fileset=DATASET_NAME, workspace="default"))

5. Start a LoRA Customization Job

Submit a LoRA job to the Automodel backend. Reference the model and dataset filesets using workspace/name.

1 job_suffix = uuid.uuid4().hex[:4]
2 JOB_NAME = f"{MODEL_NAME}-lora-job-{job_suffix}"
3 OUTPUT_NAME = f"{MODEL_NAME}-lora-{job_suffix}"
4 
5 spec = AutomodelJobInput(
6     model=f"default/{base_model.name}",
7     dataset=DatasetSpec(
8         training=f"default/{DATASET_NAME}",
9         validation=f"default/{DATASET_NAME}",
10     ),
11     training=TrainingSpec(
12         training_type="sft",
13         finetuning_type="lora",
14         max_seq_length=2048,
15     ),
16     schedule=ScheduleSpec(epochs=1),
17     batch=BatchSpec(global_batch_size=64, micro_batch_size=1),
18     optimizer=OptimizerSpec(learning_rate=5e-5),
19     parallelism=ParallelismSpec(
20         num_gpus_per_node=1,
21         num_nodes=1,
22         tensor_parallel_size=1,
23         pipeline_parallel_size=1,
24         context_parallel_size=1,
25         expert_parallel_size=1,
26     ),
27     output=OutputRequest(name=OUTPUT_NAME),
28 )
29 
30 job = client.customization.automodel.jobs.create(
31     spec=spec,
32     workspace="default",
33     name=JOB_NAME,
34 )
35 
36 print(f"Submitted job: {job.job.name}")
37 print(f"Output adapter: {OUTPUT_NAME}")

6. Track Training Progress

Poll job status until training reaches a terminal state.

1 time_check = max_wait_time_checker(3600, "Customization Job")
2 
3 while True:
4     time_check()
5     status = client.jobs.get_status(name=job.job.name, workspace="default")
6     clear_output(wait=True)
7     print(f"Job status: {status.status}")
8 
9     step = max_steps = None
10     training_phase = None
11     for job_step in status.steps or []:
12         if job_step.name == "training":
13             for task in job_step.tasks or []:
14                 details = task.status_details if isinstance(task.status_details, dict) else {}
15                 step = details.get("step")
16                 max_steps = details.get("max_steps")
17                 training_phase = details.get("phase")
18                 break
19             break
20 
21     if isinstance(step, (int, float)) and isinstance(max_steps, (int, float)) and max_steps:
22         print(f"Training: step {step}/{max_steps} ({100 * step / max_steps:.1f}%)")
23     if isinstance(training_phase, str):
24         print(f"Phase: {training_phase}")
25 
26     if status.status in ("completed", "failed", "cancelled", "error"):
27         print(f"\nJob finished: {status.status}")
28         break
29 
30     time.sleep(10)
31 
32 if status.status != "completed":
33     raise RuntimeError(f"Training job finished with status: {status.status}")

7. Deploy the Base Model With LoRA Enabled

Create a deployment for the base model with LoRA support enabled. The LoRA adapter from training is served by the same deployment.

1 deploy_suffix = uuid.uuid4().hex[:4]
2 DEPLOYMENT_CONFIG_NAME = f"private-hf-lora-cfg-{deploy_suffix}"
3 DEPLOYMENT_NAME = f"private-hf-lora-{deploy_suffix}"
4 
5 deployment_config = client.inference.deployment_configs.create(
6     workspace="default",
7     name=DEPLOYMENT_CONFIG_NAME,
8     engine="vllm",
9     model_spec={
10         "model_namespace": "default",
11         "model_name": MODEL_NAME,
12         "lora_enabled": True,
13     },
14     executor_config={
15         "gpu": 1,
16         "image_name": "vllm/vllm-openai",
17         "image_tag": "v0.22.1",
18         "additional_args": ["--max-lora-rank", "32"],
19     },
20 )
21 
22 deployment = client.inference.deployments.create(
23     workspace="default",
24     name=DEPLOYMENT_NAME,
25     config=deployment_config.name,
26 )
27 
28 print(f"Deployment name: {deployment.name}")
29 print(f"Deployment status: {deployment.status}")

1 time_check = max_wait_time_checker(1800, "Deployment")
2 while True:
3     time_check()
4     time.sleep(15)
5     deployment_status = client.inference.deployments.retrieve(
6         name=deployment.name,
7         workspace="default",
8     )
9     clear_output(wait=True)
10     print(f"Deployment: {deployment.name}")
11     print(f"Status: {deployment_status.status}")
12 
13     if deployment_status.status in ("RUNNING", "READY"):
14         if not client.models.wait_for_gateway(deployment.name, workspace="default", timeout=60):
15             raise RuntimeError("Inference gateway did not become ready")
16         break
17     if deployment_status.status in ("FAILED", "ERROR", "TERMINATED", "LOST"):
18         raise RuntimeError(f"Deployment failed with status: {deployment_status.status}")

8. Test the Deployed Model

Call the inference gateway through the SDK. Use the base model name to test the original model and the output adapter name to test the LoRA-adapted model.

Python SDK

cURL

1 messages = [
2     {"role": "user", "content": "Can you summarize what LoRA fine-tuning does?"}
3 ]
4 
5 
6 def chat(model_id: str):
7     return client.inference.gateway.provider.post(
8         "v1/chat/completions",
9         name=deployment.name,
10         workspace="default",
11         body={
12             "model": model_id,
13             "messages": messages,
14             "temperature": 0.2,
15             "max_tokens": 128,
16         },
17     )
18 
19 
20 BASE_INFERENCE_MODEL_NAME = f"default/{MODEL_NAME}"
21 INFERENCE_MODEL_NAME = f"default--{OUTPUT_NAME}"
22 base_response = chat(BASE_INFERENCE_MODEL_NAME)
23 lora_response = chat(INFERENCE_MODEL_NAME)
24 
25 print("Base model response:")
26 print(base_response["choices"][0]["message"]["content"])
27 print("\nLoRA-adapted model response:")
28 print(lora_response["choices"][0]["message"]["content"])

Next Steps

Learn how to check customization job metrics to monitor training progress and performance for your fine-tuned model.