Import and Fine-Tune Private HuggingFace Models#

Use this tutorial to learn how to import a private HuggingFace model into NeMo Customizer, fine-tune it with LoRA, and deploy it for inference.

Prerequisites#

Before you begin, ensure that you have:

  • Access to the NeMo Customizer service

  • Access to the NeMo Data Store

  • Access to the Deployment Management service

  • huggingface-cli installed on a machine with internet access

  • A HuggingFace model with a compatible architecture. Not all HuggingFace models are compatible with NeMo Customizer. This tutorial uses gemma-2-2b-it as an example, but success depends on architectural compatibility.

  • Sufficient storage space for the model files (typically 5-50GB depending on model size)

  • At least 8GB GPU memory for smaller models, more for larger models

Note

Verify that all required services are running and accessible before proceeding. You can check service health using the health endpoints documented in each service’s API specification.

Versioning Best Practices#

This tutorial uses a MODEL_VERSION environment variable to prevent conflicts when re-running. Set a unique version for each run:

# Set a unique version for this run
export MODEL_VERSION="$(date +%Y%m%d-%H%M%S)"  # e.g., 20241201-143022
# OR increment manually
export MODEL_VERSION="1"  # 2, 3, etc.

This approach ensures that each tutorial run creates uniquely named resources without conflicts. The system automatically adds the ‘v’ prefix to create versions like @v1, @v2, etc.

Known Issues#

Warning

Conv1D Model Architecture Limitation: Models that use Conv1D layers are not compatible with NeMo Customizer’s AutoModel LoRA implementation.

Error signature: AttributeError: 'Conv1D' object has no attribute 'config'

Affected models include:

  • microsoft/DialoGPT-* series

  • openai-gpt models

  • Some older gpt2 variants

  • Other models with Conv1D-based architectures

Root cause: These models use Conv1D layers that lack the linear layers expected by NeMo’s LoRA transformation utilities.

Solution: Use modern transformer architectures instead:

  • Recommended: Llama models (3.1, 3.2, 3.3 series)

  • Recommended: Nemotron models

  • Recommended: Phi models

  • Alternative: Gemma models (used in this tutorial)

For a complete list of tested models, see the Model Catalog.


Download Model From HuggingFace Hub#

First, download the model from HuggingFace Hub to your local machine that has internet access.

# Set environment variables
export MODEL_NAME="google/gemma-2-2b-it"
export LOCAL_MODEL_PATH="./downloaded_models/gemma-2-2b-it"

# Download the model using huggingface-cli
huggingface-cli download ${MODEL_NAME} \
    --local-dir ${LOCAL_MODEL_PATH} \
    --local-dir-use-symlinks False

Note

Replace google/gemma-2-2b-it with your desired model. For private models, you’ll need to authenticate with huggingface-cli login first.

Important: Ensure your chosen model is compatible with NeMo Customizer’s architecture requirements before proceeding.


Create Model in DataStore#

Next, create a model repository in the NeMo Data Store and upload the downloaded model files.

Create Namespace and Model Repository#

import os
import requests
from huggingface_hub import HfApi

# Set environment variables - Update these to match your deployment
NEMO_BASE_URL = os.getenv("NEMO_BASE_URL", "http://nemo.test")
DATASTORE_URL = os.getenv("DATASTORE_URL", "http://data-store.test")
MODEL_NAME = "gemma-2-2b-it"
NAMESPACE = "my-org"
MODEL_VERSION = os.getenv("MODEL_VERSION", f"{int(time.time())}")  # Unique version for this run

# Initialize HF API for datastore
# Note: Empty token is correct for internal datastore
hf_api = HfApi(endpoint=f"{DATASTORE_URL}/v1/hf", token="")

# Create namespace
namespace_url = f"{DATASTORE_URL}/v1/datastore/namespaces"
resp = requests.post(namespace_url, json={"namespace": NAMESPACE})
print(f"Namespace creation status: {resp.status_code}")

# Create model repository in datastore
repo_id = f"{NAMESPACE}/{MODEL_NAME}"
hf_api.create_repo(repo_id, repo_type="model", exist_ok=True)
model_info = hf_api.model_info(repo_id)
print(f"Created model repository: {repo_id}")
# Set environment variables - Update these to match your deployment
export NEMO_BASE_URL="http://nemo.test"
export DATASTORE_URL="http://data-store.test"
export MODEL_NAME="gemma-2-2b-it"
export NAMESPACE="my-org"
export MODEL_VERSION="$(date +%Y%m%d-%H%M%S)"  # Unique version for this run

# Create namespace
curl -X POST "${DATASTORE_URL}/v1/datastore/namespaces" \
  -H 'Content-Type: application/json' \
  -d '{"namespace": "'${NAMESPACE}'"}'

# Create model repository in datastore
curl -X POST "${DATASTORE_URL}/v1/hf/api/repos/create" \
  -H 'Content-Type: application/json' \
  -d '{
      "organization": "'${NAMESPACE}'", 
      "name": "'${MODEL_NAME}'",
      "type": "model"
  }'

Upload Model Files to DataStore#

Upload the downloaded model files to the datastore repository:

# Upload the entire model directory to datastore
LOCAL_MODEL_PATH = "./downloaded_models/gemma-2-2b-it"
repo_id = f"{NAMESPACE}/{MODEL_NAME}"

# Upload all files in the model directory
hf_api.upload_folder(
    repo_id=repo_id,
    folder_path=LOCAL_MODEL_PATH,
    repo_type="model",
    revision="main",
    commit_message=f"Upload {MODEL_NAME} model files"
)

print(f"Model files uploaded to {repo_id}")

Create Model Entity in Entity Store#

# Create model entity in Entity Store
model_payload = {
    "name": MODEL_NAME,
    "namespace": NAMESPACE,
    "description": f"Private {MODEL_NAME} model imported for customization",
    "artifact": {
        "files_url": f"hf://models/{repo_id}",
        "backend_engine": "hugging_face",
        "status": "upload_completed"
    },
    "spec": {
        "num_parameters": 2610000000,  # Gemma-2-2b has 2.61B parameters
        "context_size": 8192,  # Gemma 2 supports up to 8k context
        "is_chat": True,  # Gemma 2 is an instruct/chat model
        "num_virtual_tokens": -1
    }
}

entity_store_url = f"{NEMO_BASE_URL}/v1/models"
resp = requests.post(entity_store_url, json=model_payload)
print(f"Model entity creation status: {resp.status_code}")
if resp.status_code in [200, 201]:
    model_entity = resp.json()
    print(f"Created model entity: {model_entity['namespace']}/{model_entity['name']}")

Create Customization Target#

Create a customization target that references the uploaded model in the datastore.

from nemo_microservices import NeMoMicroservices

# Initialize the client with proper base URLs
client = NeMoMicroservices(
    base_url=os.environ.get('NEMO_BASE_URL', 'http://nemo.test'),
    inference_base_url=os.environ.get('NIM_PROXY_URL', 'http://nemo.test')
)

# Create customization target
target = client.customization.targets.create(
    name=f"{MODEL_NAME}@v{MODEL_VERSION}",
    namespace=NAMESPACE,
    description=f"Customization target for {MODEL_NAME}",
    enabled=True,
    model_uri=f"hf://{NAMESPACE}/{MODEL_NAME}@main",
    num_parameters=46700000000,
    precision="bf16-mixed"
)

print(f"Created target: {target.name}")
print(f"Target ID: {target.id}")
print(f"Status: {target.status}")
# Create customization target
curl -X POST \
  "${NEMO_BASE_URL}/v1/customization/targets" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
      "name": "'${MODEL_NAME}'@v'${MODEL_VERSION}'",
      "namespace": "'${NAMESPACE}'",
      "description": "Customization target for '${MODEL_NAME}'",
      "enabled": true,
      "model_uri": "hf://'${NAMESPACE}'/'${MODEL_NAME}'@main",
      "num_parameters": 46700000000,
      "precision": "bf16-mixed"
      }' | jq

Wait for the model to be downloaded and ready:

import time

# Check target status
while True:
    target_details = client.customization.targets.retrieve(
        namespace=NAMESPACE,
        name=f"{MODEL_NAME}@v{MODEL_VERSION}"
    )
    
    print(f"Target status: {target_details.status}")
    
    if target_details.status == "ready":
        print("Model is ready for customization!")
        break
    elif target_details.status == "failed":
        print("Model download failed!")
        break
    
    time.sleep(30)  # Wait 30 seconds before checking again

Create Customization Configuration#

Create a configuration for LoRA fine-tuning:

# Create customization configuration
config = client.customization.configs.create(
    name=f"{MODEL_NAME}-lora-config@v{MODEL_VERSION}",
    namespace=NAMESPACE,
    target=f"{NAMESPACE}/{MODEL_NAME}@v{MODEL_VERSION}",
    description=f"LoRA configuration for {MODEL_NAME}",
    training_options=[
        {
            "training_type": "sft",
            "finetuning_type": "lora",
            "num_gpus": 1,  # Gemma-2-2b can run on single GPU
            "num_nodes": 1,
            "tensor_parallel_size": 1,
            "pipeline_parallel_size": 1,
            "micro_batch_size": 1,
            "global_batch_size": 8
        }
    ],
    training_precision="bf16-mixed",
    max_seq_length=4096,  # Use longer sequences for Gemma 2
    prompt_template="{prompt} {completion}"  # Gemma 2 uses standard instruction format
)

print(f"Created config: {config.name}")
print(f"Config ID: {config.id}")
# Create customization configuration
curl -X POST \
  "${NEMO_BASE_URL}/v1/customization/configs" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
      "name": "'${MODEL_NAME}'-lora-config@v'${MODEL_VERSION}'",
      "namespace": "'${NAMESPACE}'",
      "target": "'${NAMESPACE}'/'${MODEL_NAME}'@v'${MODEL_VERSION}'",
      "description": "LoRA configuration for '${MODEL_NAME}'",
      "training_options": [
          {
              "training_type": "sft",
              "finetuning_type": "lora",
              "num_gpus": 2,
              "num_nodes": 1,
              "tensor_parallel_size": 2,
              "pipeline_parallel_size": 1,
              "micro_batch_size": 1,
              "global_batch_size": 8
          }
      ],
      "training_precision": "bf16-mixed",
      "max_seq_length": 4096,
      "prompt_template": "{prompt} {completion}"
      }' | jq

Prepare Training and Validation Datasets#

Before starting the customization job, you need to prepare both training and validation datasets. The validation dataset is required for monitoring training progress and preventing overfitting.

Create datasets in JSONL format:

# Example training data for Gemma 2 (instruction format)
# Gemma 2 uses standard chat message format
training_data = [
    {"messages": [{"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing well, thank you for asking! How can I assist you today?"}]},
    {"messages": [{"role": "user", "content": "What's the weather like?"}, {"role": "assistant", "content": "I don't have access to current weather data, but I can help you find weather information or discuss weather patterns."}]},
    {"messages": [{"role": "user", "content": "Tell me a joke"}, {"role": "assistant", "content": "Why don't scientists trust atoms? Because they make up everything!"}]},
    {"messages": [{"role": "user", "content": "Can you help me with coding?"}, {"role": "assistant", "content": "Absolutely! I'd be happy to help you with coding. What programming language or specific problem are you working on?"}]},
    {"messages": [{"role": "user", "content": "What is machine learning?"}, {"role": "assistant", "content": "Machine learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed."}]},
    {"messages": [{"role": "user", "content": "How do I write a function in Python?"}, {"role": "assistant", "content": "To write a function in Python, use the 'def' keyword followed by the function name and parameters: def my_function(param): return result"}]},
    {"messages": [{"role": "user", "content": "What are the benefits of exercise?"}, {"role": "assistant", "content": "Exercise offers numerous benefits including improved cardiovascular health, stronger muscles, better mental health, and increased energy levels."}]},
    {"messages": [{"role": "user", "content": "Explain the solar system"}, {"role": "assistant", "content": "The solar system consists of the Sun and all celestial objects that orbit it, including eight planets, moons, asteroids, and comets."}]}
]

# Expand training data by repeating examples to create a larger dataset
training_data = training_data * 10  # 80 total examples (8 unique * 10 repetitions)

# Save training data to JSONL file
import json
with open("training_data.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps(item) + "\n")

# Create dataset in datastore (similar to model creation)
dataset_name = f"{MODEL_NAME}-training-data"
dataset_repo_id = f"{NAMESPACE}/{dataset_name}"

# Create dataset repository
hf_api.create_repo(dataset_repo_id, repo_type="dataset", exist_ok=True)

# Upload training data
hf_api.upload_file(
    path_or_fileobj="training_data.jsonl",
    path_in_repo="training.jsonl",
    repo_id=dataset_repo_id,
    repo_type="dataset",
    commit_message="Upload training data"
)

validation_data = [
    {"messages": [{"role": "user", "content": "How do you stay motivated?"}, {"role": "assistant", "content": "Staying motivated involves setting clear goals, celebrating small wins, and maintaining a positive mindset even when facing challenges."}]},
    {"messages": [{"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris, a beautiful city known for its art, culture, and iconic landmarks like the Eiffel Tower."}]},
    {"messages": [{"role": "user", "content": "Explain quantum physics"}, {"role": "assistant", "content": "Quantum physics is the branch of physics that studies matter and energy at the smallest scales, where particles behave in ways that seem strange compared to our everyday experience."}]},
    {"messages": [{"role": "user", "content": "What is photosynthesis?"}, {"role": "assistant", "content": "Photosynthesis is the process by which plants convert sunlight, carbon dioxide, and water into glucose and oxygen, providing energy for the plant."}]},
    {"messages": [{"role": "user", "content": "How does the internet work?"}, {"role": "assistant", "content": "The internet works through a global network of interconnected computers that communicate using standardized protocols to share information and resources."}]},
    {"messages": [{"role": "user", "content": "What is artificial intelligence?"}, {"role": "assistant", "content": "Artificial intelligence is the simulation of human intelligence in machines that are programmed to think, learn, and solve problems like humans."}]},
    {"messages": [{"role": "user", "content": "Describe the water cycle"}, {"role": "assistant", "content": "The water cycle is the continuous movement of water through evaporation, condensation, precipitation, and collection that sustains life on Earth."}]},
    {"messages": [{"role": "user", "content": "What are renewable energy sources?"}, {"role": "assistant", "content": "Renewable energy sources include solar, wind, hydroelectric, geothermal, and biomass energy that can be naturally replenished and don't deplete over time."}]}
]

# Expand validation data by repeating examples to create a larger dataset
validation_data = validation_data * 10  # 80 total examples (8 unique * 10 repetitions)

# Save validation data to JSONL file
with open("validation_data.jsonl", "w") as f:
    for item in validation_data:
        f.write(json.dumps(item) + "\n")

# Upload validation data
hf_api.upload_file(
    path_or_fileobj="validation_data.jsonl",
    path_in_repo="validation.jsonl",
    repo_id=dataset_repo_id,
    repo_type="dataset",
    commit_message="Upload validation data"
)

# Create dataset entity in Entity Store
dataset_payload = {
    "name": dataset_name,
    "namespace": NAMESPACE,
    "files_url": f"hf://datasets/{dataset_repo_id}"
}

resp = requests.post(f"{NEMO_BASE_URL}/v1/datasets", json=dataset_payload)
print(f"Dataset creation status: {resp.status_code}")

Start Customization Job#

Start the LoRA fine-tuning job:

# Start customization job
job = client.customization.jobs.create(
    name=f"{MODEL_NAME}-lora-job",
    config=f"{NAMESPACE}/{MODEL_NAME}-lora-config@v{MODEL_VERSION}",
    dataset=f"{NAMESPACE}/{dataset_name}",
    output_model=f"{NAMESPACE}/{MODEL_NAME}-lora@v{MODEL_VERSION}",
    description=f"LoRA fine-tuning job for {MODEL_NAME}",
    hyperparameters={
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 3,
        "batch_size": 8,  # Minimum required batch size
        "learning_rate": 5e-5,
        "lora": {
            "adapter_dim": 16,
            "alpha": 32,
            "adapter_dropout": 0.1
        }
    }
)

print(f"Started job: {job.name}")
print(f"Job ID: {job.id}")
print(f"Status: {job.status}")
# Start customization job
curl -X POST \
  "${NEMO_BASE_URL}/v1/customization/jobs" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
      "name": "'${MODEL_NAME}'-lora-job",
      "config": "'${NAMESPACE}'/'${MODEL_NAME}'-lora-config@v'${MODEL_VERSION}'",
      "dataset": "'${NAMESPACE}'/'${dataset_name}'",
      "output_model": "'${NAMESPACE}'/'${MODEL_NAME}'-lora@v'${MODEL_VERSION}'",
      "description": "LoRA fine-tuning job for '${MODEL_NAME}'",
      "hyperparameters": {
          "training_type": "sft",
          "finetuning_type": "lora",
          "epochs": 3,
          "batch_size": 8,
          "learning_rate": 5e-5,
          "lora": {
              "adapter_dim": 16,
              "alpha": 32,
              "adapter_dropout": 0.1
          }
      }
      }' | jq

Monitor the job progress:

# Monitor job status
while True:
    job_details = client.customization.jobs.retrieve(
        namespace=NAMESPACE,
        name=f"{MODEL_NAME}-lora-job"
    )
    
    print(f"Job status: {job_details.status}")
    
    if job_details.status == "completed":
        print("Training completed successfully!")
        break
    elif job_details.status == "failed":
        print("Training failed!")
        break
    
    time.sleep(60)  # Wait 1 minute before checking again

Deploy the Fine-Tuned Model#

Once training is complete, deploy the fine-tuned model using the Deployment Management service.

LoRA Adapter Deployment#

LoRA adapters are deployed alongside the base model. The NIM service automatically loads the adapter when the base model is deployed with the appropriate environment variables.

Note

LoRA adapters are automatically discovered and loaded by NIM when the following environment variables are configured:

  • NIM_PEFT_SOURCE: Points to the entity store URL (default: http://nemo-entity-store:8000)

  • NIM_PEFT_REFRESH_INTERVAL: Refresh interval in seconds (default: 30)

These are set by default in the deployment service when LoRA support is enabled.

# Create deployment configuration that references the base model
# LoRA adapters are automatically loaded via NIM_PEFT_SOURCE
deployment_config = {
    "name": f"{MODEL_NAME}-deployment-config",
    "namespace": NAMESPACE,
    "model": f"{NAMESPACE}/{MODEL_NAME}@v{MODEL_VERSION}",  # Reference the versioned base model
    "nim_deployment": {
        "image_name": "nvcr.io/nvidia/nemo-microservices/nim-service",
        "image_tag": "latest",
        "gpu": 1,  # Gemma-2-2b can run on single GPU
        # LoRA support is enabled by default
        # The following environment variables are automatically configured:
        # - NIM_PEFT_SOURCE: http://nemo-entity-store:8000
        # - NIM_PEFT_REFRESH_INTERVAL: 30
    }
}

# Create deployment configuration
deployment_config_url = f"{NEMO_BASE_URL}/v1/deployment/configs"
resp = requests.post(deployment_config_url, json=deployment_config)
print(f"Deployment config creation status: {resp.status_code}")

# Create model deployment
deployment_data = {
    "name": f"{MODEL_NAME}-deployment",
    "namespace": NAMESPACE,
    "config": f"{NAMESPACE}/{MODEL_NAME}-deployment-config"
}

model_deployment_url = f"{NEMO_BASE_URL}/v1/deployment/model-deployments"
resp = requests.post(model_deployment_url, json=deployment_data)
print(f"Model deployment status: {resp.status_code}")

if resp.status_code == 200:
    deployment = resp.json()
    print(f"Created deployment: {deployment['name']}")
# Create deployment configuration for LoRA adapter
 curl -X POST \
  "${NEMO_BASE_URL}/v1/deployment/configs" \
  -H 'Content-Type: application/json' \
  -d '{
      "name": "'${MODEL_NAME}'-deployment-config",
      "namespace": "'${NAMESPACE}'",
      "model": "'${NAMESPACE}'/'${MODEL_NAME}'@v'${MODEL_VERSION}'",
      "nim_deployment": {
          "image_name": "nvcr.io/nim/nvidia/llm-nim",
          "image_tag": "1.12.0",
          "gpu": 2
      }
  }'

# Create model deployment
curl -X POST \
  "${NEMO_BASE_URL}/v1/deployment/model-deployments" \
  -H 'Content-Type: application/json' \
  -d '{
      "name": "'${MODEL_NAME}'-deployment",
      "namespace": "'${NAMESPACE}'",
      "config": "'${NAMESPACE}'/'${MODEL_NAME}'-deployment-config"
  }'

Monitor Deployment Status#

Wait for the deployment to be ready:

# Monitor deployment status
deployment_name = f"{MODEL_NAME}-deployment"
max_retries = 20
retry_interval = 30

for attempt in range(max_retries):
    resp = requests.get(f"{model_deployment_url}/{NAMESPACE}/{deployment_name}")
    
    if resp.status_code == 200:
        deployment_status = resp.json()
        status = deployment_status.get('status_details', {}).get('status', 'unknown')
        print(f"Deployment status: {status}")
        
        if deployment_status.get("deployed") is True and status == "ready":
            print("Deployment is ready!")
            break
    
    time.sleep(retry_interval)
else:
    print("Deployment did not become ready within the expected time")

Test the Deployed Model#

Test the deployed fine-tuned model to ensure it’s working correctly:

# Test the deployed model
nim_proxy_url = os.getenv("NIM_PROXY_URL", "http://nemo.test")

# First, check if the model is available
models_response = requests.get(f"{nim_proxy_url}/models")
available_models = models_response.json()

model_id = f"{NAMESPACE}/{MODEL_NAME}@v{MODEL_VERSION}"
model_found = False
for model in available_models.get("data", []):
    if model["id"] == model_id:
        model_found = True
        print(f"Model {model_id} is available for inference")
        break

if not model_found:
    print(f"Model {model_id} is not yet available")
    print("Available models:")
    for model in available_models.get("data", []):
        print(f"  - {model['id']}")

# Test chat completion
test_payload = {
    "model": model_id,
    "messages": [
        {
            "role": "user",
            "content": "Hello, can you help me?"
        }
    ],
    "max_tokens": 100,
    "temperature": 0.7
}

response = requests.post(
    f"{nim_proxy_url}/chat/completions",
    headers={"Content-Type": "application/json"},
    json=test_payload
)

if response.status_code == 200:
    result = response.json()
    print("Model response:")
    print(result["choices"][0]["message"]["content"])
else:
    print(f"Error: {response.status_code} - {response.text}")

Test with cURL#

# Set the NIM proxy URL
export NIM_PROXY_URL="http://nemo.test"

# Test model availability
curl -X GET "${NIM_PROXY_URL}/models" | jq

# Test inference
   curl -X POST "${NIM_PROXY_URL}/chat/completions" \
     -H 'Content-Type: application/json' \
     -d '{
         "model": "'${NAMESPACE}'/'${MODEL_NAME}'@v'${MODEL_VERSION}'",
         "messages": [
             {
                 "role": "user",
                 "content": "Hello, can you help me?"
             }
         ],
         "max_tokens": 100,
         "temperature": 0.7
     }'

Next Steps#

Learn how to check customization job metrics to monitor the training progress and performance of your fine-tuned model.