Download this tutorial as a Jupyter notebook

LoRA Model Customization Job#

Learn how to use the NeMo Platform to create a LoRA (Low-Rank Adaptation) customization job using a custom dataset. In this tutorial we use LoRA to fine-tune a question-answering model from the SQuAD dataset.

LoRA is a parameter-efficient fine-tuning method that requires fewer computational resources than full fine-tuning. If you need full model fine-tuning instead, see the Full SFT Customization Job tutorial.

Time to complete: approximately 45 minutes. Job duration increases with model size and dataset size.

Prerequisites#

Before starting this tutorial, ensure you have:

Completed the Quickstart to install and deploy NeMo Platform locally
Installed the Python SDK (included with pip install nemo-platform)
Installed the datasets package for loading SQuAD: pip install datasets
At least one GPU with CUDA 12.8+

Quick Start#

1. Initialize SDK#

The SDK needs to know your NMP server URL. By default, http://localhost:8080 is used. If NMP is running elsewhere, set the NMP_BASE_URL environment variable:

export NMP_BASE_URL=<YOUR_NMP_BASE_URL>

import os
import json
import re
import time
import uuid
from pathlib import Path
from nemo_platform import NeMoPlatform, ConflictError
from nemo_platform.types.secrets import PlatformSecretResponse
from nemo_platform.types.files import HuggingfaceStorageConfigParam
from nemo_platform.types.customization import (
    CustomizationJobInputParam,
    DeploymentParamsParam,
    LoRaParamsParam,
    ParallelismParamsParam,
    SftTrainingParam,
)


def sanitize_name(prefix: str, name: str):
    """Sanitize model_name for deployment/config naming. Compatible with platform naming rules."""
    name = name.split("/")[-1]
    sanitized = re.sub(r"[^a-z0-9@.+_-]", "-", name.lower())
    sanitized = re.sub(r"-+", "-", sanitized).strip("-")
    return f"{prefix}-{sanitized}"[:59].rstrip("-")


def max_wait_time_checker(seconds: int, job_name: str = ""):
    """Return a check() that raises TimeoutError if called after `seconds` have elapsed."""
    start_time = time.time()

    def check():
        if time.time() - start_time > seconds:
            raise TimeoutError(f"{job_name} took longer than {seconds} seconds")

    return check


NMP_BASE_URL = os.environ.get("NMP_BASE_URL", "http://localhost:8080")
sdk = NeMoPlatform(
    base_url=NMP_BASE_URL,
    workspace="default"
)

2. Prepare Dataset#

Create your data in JSONL format (one JSON object per line). For SFT with LoRA, the platform expects prompt/completion pairs.

Dataset structure:

Training files under training/ (or root with training.jsonl)
Validation files under validation/ (or validation.jsonl)

SFT format: Each line is a JSON object with:

prompt: The input (e.g. context + question)
completion: The desired model output

Example record (single line in .jsonl):

{"prompt": "Context: ... Question: What is X? Answer:", "completion": "X is ..."}

Download SQuAD and Convert to SFT Format#

We use the SQuAD dataset and convert it to prompt/completion JSONL.

from datasets import load_dataset, DatasetDict

print("Loading dataset rajpurkar/squad")
raw_dataset = load_dataset("rajpurkar/squad")
if not isinstance(raw_dataset, DatasetDict):
    raise ValueError("Dataset does not contain expected splits")
print("Loaded dataset")

VALIDATION_PROPORTION = 0.05
SEED = 1234
training_size = 3000
validation_size = 300
DATASET_NAME = "sft-dataset"
DATASET_PATH = Path("sft-dataset").absolute()

os.makedirs(DATASET_PATH, exist_ok=True)
train_set = raw_dataset.get("train")
split_dataset = train_set.train_test_split(test_size=VALIDATION_PROPORTION, seed=SEED)
train_ds = split_dataset["train"].select(range(min(training_size, len(split_dataset["train"]))))
validation_ds = split_dataset["test"].select(range(min(validation_size, len(split_dataset["test"]))))

def convert_squad_to_sft_format(example):
    prompt = f"Context: {example['context']} Question: {example['question']} Answer:"
    completion = example["answers"]["text"][0]
    return {"prompt": prompt, "completion": completion}

with open(f"{DATASET_PATH}/training.jsonl", "w", encoding="utf-8") as f:
    for example in train_ds:
        f.write(json.dumps(convert_squad_to_sft_format(example)) + "\n")
with open(f"{DATASET_PATH}/validation.jsonl", "w", encoding="utf-8") as f:
    for example in validation_ds:
        f.write(json.dumps(convert_squad_to_sft_format(example)) + "\n")

print(f"Saved training.jsonl with {len(train_ds)} rows")
print(f"Saved validation.jsonl with {len(validation_ds)} rows")
with open(f"{DATASET_PATH}/training.jsonl", "r") as f:
    sample = json.loads(f.readline())
print("Sample prompt (first 200 chars):", sample["prompt"][:200] + "...")
print("Sample completion:", sample["completion"])

3. Create FileSet and Upload Training Data#

Upload the training and validation JSONL files to a FileSet so the customization job can use them.

try:
    sdk.files.filesets.create(
        workspace="default",
        name=DATASET_NAME,
        description="SFT training data",
        cache=True,
    )
    print(f"Created fileset: {DATASET_NAME}")
except ConflictError:
    print(f"Fileset '{DATASET_NAME}' already exists, continuing...")

sdk.files.fsspec.put(
    lpath=DATASET_PATH,
    rpath=f"default/{DATASET_NAME}/",
    recursive=True
)
print("Training data:")
print(sdk.files.list(fileset=DATASET_NAME, workspace="default"))

4. Secrets Setup#

For Huggingface models that require authentication, create a secret with your HF token. Get a token from Huggingface Settings and accept the model terms.

This is generally true for LLaMa based models (e.g. Llama-3.2-1B-Instruct).

export HF_TOKEN=<your-huggingface-token>

HF_TOKEN = os.getenv("HF_TOKEN")

def create_or_get_secret(name: str, value: str | None, label: str) -> PlatformSecretResponse | None:
    if not value:
        print(f"{label} is not set - skipping setting secret")
        return None
    try:
        secret = sdk.secrets.create(name=name, workspace="default", data=value)
        print(f"Created secret: {name}")
        print(secret.model_dump_json(indent=2))
        return secret
    except ConflictError:
        print(f"Secret '{name}' already exists, continuing...")
        secret = sdk.secrets.retrieve(name=name, workspace="default")
        print(secret.model_dump_json(indent=2))
        return secret


hf_secret = create_or_get_secret("hf-token", HF_TOKEN, "HF_TOKEN")

5. Create Base Model FileSet and Model Entity#

Create a fileset pointing to Qwen/Qwen3-0.6B and a Model Entity that references it. Model download happens when the customization job runs.

HF_REPO_ID = "Qwen/Qwen3-0.6B"
MODEL_NAME = "qwen3-0.6b"

try:
    storage = HuggingfaceStorageConfigParam(
        type="huggingface",
        repo_id=HF_REPO_ID,
        repo_type="model",
    )
    if hf_secret:
        storage["token_secret"] = hf_secret.name
    base_model_fs = sdk.files.filesets.create(
        workspace="default",
        name=MODEL_NAME,
        description="Qwen3 0.6b base model from Huggingface",
        storage=storage,
        cache=True,
    )
except ConflictError:
    base_model_fs = sdk.files.filesets.retrieve(workspace="default", name=MODEL_NAME)

try:
    base_model = sdk.models.create(
        workspace="default",
        name=MODEL_NAME,
        fileset=f"default/{MODEL_NAME}",
        trust_remote_code=False,
    )
except ConflictError:
    sdk.models.update(
        workspace="default",
        name=MODEL_NAME,
        fileset=f"default/{MODEL_NAME}",
        trust_remote_code=False,
    )
    base_model = sdk.models.retrieve(workspace="default", name=MODEL_NAME)

print(f"Base model fileset: fileset://default/{base_model.name}")
print(sdk.files.list(fileset=MODEL_NAME, workspace="default"))

time_check = max_wait_time_checker(600, "Model Spec")
while not base_model.spec:
    time_check()
    time.sleep(10)
    base_model = sdk.models.retrieve(workspace="default", name=MODEL_NAME)

# Clear verbose linear_layers list for cleaner output
base_model.spec.linear_layers = None
print(f"ModelSpec: {base_model.spec}")

6. Create LoRA Customization Job#

Submit a customization job with training=SftTrainingParam(type="sft", peft=LoRaParamsParam(type="lora"), ...). Set lora_enabled=True in the deployment_config so the platform can deploy the base model with LoRA support automatically.

When LoRA Enabled is set to true for Models Deployed via the /apis/models endpoint or via the deployment_config option during the customization job, all LoRA adapters (enabled by default) will get automatically deployed in the NIM.

job_suffix = uuid.uuid4().hex[:4]
JOB_NAME = f"my-sft-job-{job_suffix}"

job = sdk.customization.jobs.create(
    name=JOB_NAME,
    workspace="default",
    spec=CustomizationJobInputParam(
        model=f"default/{base_model.name}",
        dataset=f"fileset://default/{DATASET_NAME}",
        training=SftTrainingParam(
            type="sft",
            epochs=2,
            batch_size=64,
            learning_rate=0.00005,
            max_seq_length=2048,
            parallelism=ParallelismParamsParam(
                num_gpus_per_node=1,
                num_nodes=1,
                tensor_parallel_size=1,
                pipeline_parallel_size=1,
                context_parallel_size=1,
                expert_parallel_size=1,
            ),
            micro_batch_size=1,
            peft=LoRaParamsParam(type="lora"),
        ),
        deployment_config=DeploymentParamsParam(
            lora_enabled=True,
            gpu=1,
            additional_envs={"NIM_MODEL_PROFILE": "vllm-lora"},
        ),
    ),
)
print(f"Job ID: {job.name}")
print(f"Output model: {job.spec.output.name}")

7. Track Training Progress#

Poll job status until it completes. Progress (step/max_steps) is shown when available.

from IPython.display import clear_output

time_check = max_wait_time_checker(3600, "Customization Job")
while True:
    time_check()
    status = sdk.customization.jobs.get_status(name=job.name, workspace="default")
    clear_output(wait=True)
    print(f"Job Status: {status.status}")
    step = max_steps = training_phase = None
    for job_step in status.steps or []:
        if job_step.name == "customization-training-job":
            for task in job_step.tasks or []:
                d = task.status_details or {}
                step, max_steps = d.get("step"), d.get("max_steps")
                training_phase = d.get("phase")
                break
            break
    if step is not None and max_steps is not None:
        print(f"Training: Step {step}/{max_steps} ({100 * step / max_steps:.1f}%)")
        if training_phase:
            print(f"Phase: {training_phase}")
    if status.status in ("completed", "failed", "cancelled", "error"):
        print(f"\nJob finished: {status.status}")
        break
    time.sleep(10)

assert status.status == "completed"

8. Validate Output Model and Deployment#

With deployment_config configured, the platform will create a NIM deployment for the base model after training. The fine-tuned LoRA adapter is enabled by default and automatically served through the deployment. Check the model entity and deployment status.

model_entity = sdk.models.retrieve(workspace="default", name=MODEL_NAME)
# Clear verbose linear_layers list for cleaner output
model_entity.spec.linear_layers = None
print(model_entity.model_dump_json(indent=2))

deployment_name = sanitize_name("sft-deploy", job.spec.model)
deployment_status = sdk.inference.deployments.retrieve(name=deployment_name, workspace="default")
print(f"Deployment status: {deployment_status.status}")

9. Monitor Deployment Until Ready#

Wait for the deployment to reach RUNNING/READY before sending inference requests.

TIMEOUT_MINUTES = 30
start_time = time.time()
time_check = max_wait_time_checker(TIMEOUT_MINUTES * 60, "Deployment")
print(f"Monitoring deployment '{deployment_name}'... (timeout {TIMEOUT_MINUTES} min)\n")

while True:
    time.sleep(15)
    time_check()
    deployment_status = sdk.inference.deployments.retrieve(name=deployment_name, workspace="default")
    elapsed = time.time() - start_time
    clear_output(wait=True)
    print(f"Deployment: {deployment_name}")
    print(f"Status: {deployment_status.status}")
    print(f"Elapsed: {int(elapsed // 60)}m {int(elapsed % 60)}s")
    if deployment_status.status in ("RUNNING", "READY"):
        print("\nDeployment is ready!")
        if not sdk.models.wait_for_gateway(deployment_name, workspace="default", timeout=60):
            raise RuntimeError("Inference gateway did not become ready")
        break
    if deployment_status.status in ("FAILED", "ERROR", "TERMINATED"):
        raise RuntimeError(f"Deployment failed with status: {deployment_status.status}")

assert deployment_status.status in ("RUNNING", "READY")

10. Check Model Output#

Send a chat completion request to the deployed LoRA model and compare the output to the expected answer.

context = "The Apollo 11 mission was the first manned mission to land on the Moon. It was launched on July 16, 1969, and Neil Armstrong became the first person to walk on the lunar surface on July 20, 1969. Buzz Aldrin joined him shortly after, while Michael Collins remained in lunar orbit."
question = "Who was the first person to walk on the Moon?"
messages = [
    {"role": "user", "content": f"Based on the following context, answer the question.\n\nContext: {context}\n\nQuestion: {question}"}
]
response = sdk.inference.gateway.provider.post(
    "v1/chat/completions",
    name=deployment_name,
    workspace="default",
    body={
        "model": job.spec.output.name,
        "messages": messages,
        "temperature": 0,
        "max_tokens": 256,
    }
)
print("=" * 60)
print("MODEL INFERENCE")
print("=" * 60)
print(f"Question: {question}")
print(f"Expected: Neil Armstrong")
print(f"Model output: {response['choices'][0]['message']['content']}")

Conclusion#

You have started a LoRA customization job, monitored it to completion, and evaluated the fine-tuned model. Use the output.name to access the model for further inference or evaluation.

Next Steps#

Monitor training metrics in detail
Evaluate your fine-tuned model using the Evaluator service
Try Full SFT or DPO for other customization options