Start a LoRA Model Customization Job#
Learn how to use the NeMo Microservices Platform to create a LoRA (Low-Rank Adaptation) customization job using a custom dataset. In this tutorial, we’ll use LoRA to fine-tune an email writing assistant model.
Note
The time to complete this tutorial is approximately 45 minutes. In this tutorial, you run a customization job. Job duration increases with the number of model parameters and the dataset size.
Prerequisites#
Before you can start a LoRA customization job, make sure that you have:
Access to a NeMo Customizer Microservice.
Completed the Manage Entities tutorial series, or set up a dedicated project.
The
huggingface_hub
Python package installed.
Select Model#
Find Available Configs#
First, we need to identify what model customization configurations are available for you to use. This will describe the models and corresponding techniques you can choose.
Get all customization configurations.
curl -X GET "https://${CUSTOMIZER_HOSTNAME}/v1/customization/configs" \ -H 'Accept: application/json' | jq
Review the response to find a model that meets your requirements.
Example Response
{ "object": "list", "data": [ { "name": "meta/llama-3.1-8b-instruct", "namespace": "default", "dataset_schema": { "title": "NDJSONFile", "type": "array", "items": { "description": "Schema for Supervised Fine-Tuning (SFT) training data items.\n\nDefines the structure for training data used in SFT.", "properties": { "prompt": { "description": "The prompt for the entry", "title": "Prompt", "type": "string" }, "completion": { "description": "The completion to train on", "title": "Completion", "type": "string" } }, "required": [ "prompt", "completion" ], "title": "SFTDatasetItemSchema", "type": "object" }, "description": "Newline-delimited JSON (NDJSON) file containing MyModel objects" }, "training_options": [ { "training_type": "sft", "finetuning_type": "lora", "num_gpus": 1, "num_nodes": 1, "tensor_parallel_size": 1, "use_sequence_parallel": false }, { "training_type": "sft", "finetuning_type": "all_weights", "num_gpus": 8, "num_nodes": 1, "tensor_parallel_size": 8, "use_sequence_parallel": false } ] }
Note
For more information on the response fields, review the Customization Configs schema reference.
The response shows a single model option available for our fine-tuning job: meta/llama-3.1-8b-instruct
. The model supports two training types: SFT with LoRA or full SFT. Training with LoRA only requires one GPU. SFT requires 8 GPUs. The cluster where this is deployed has eight GPU per cluster, so it is configured to use one node. The supported dataset schema is prompt
and completion
.
Determine Resource Requirements#
For this tutorial, we’ll proceed with lora
model since it has lower resource requirements.
Review Dataset Schema#
You can examine the dataset_schema
field in the response to understand what data format your model requires.
The schema outlines the specific fields and data types your dataset needs to include, formatted as New Line Delimited JSON
(NDJSON). In the next section, we’ll walk through an example to help you understand the schema structure.
Create Datasets#
Now that we know the required shape of the dataset expected by the model configuration, we can prepare our training and validation files and upload them to the dataset.
Prepare Files#
Create two files,
train.jsonl
andvalidation.jsonl
.Populate the files with data that is compatible with the
meta/llama-3.1-8b-instruct
model.
Note
Each record should be on a single line in your .jsonl
file, with no line breaks within the JSON objects.
{"prompt": "You are an email writting assistant, please help people write cogent emails.\n\nTell John I finally got the PS5 with 1TB from Target and will be home at 8 pm so we can play Call of Duty. State Jason told me it was fun.", "completion": "Dear John, I got it! I managed to find the new PS5 at Target. I will be home by 8pm so we can play Call of Duty on it. It actually is a lot of fun (or so has Jason told me). It is the 1tb one so I hope it serves us well. See you later. Jackson Lee"}
{"prompt": "You are an email writting assistant, please help people write cogent emails.\n\nCheck in with the horseback riding facility if my daughter can go for a trial ride to see if she likes it. Ask if the horses are gentle and what the capacity is for each lesson.","completion": "Hello! My 9-year-old is interested in horseback riding lessons. She's only ridden ponies at the beach, but had taken up a kean interest in horses lately. How many children are in each lesson group? Are the horses gentle? Would she be able to try a lesson before committing to a full six-month contract? Thank you for replying at your earliest convenience. Heather"}
Upload Training Data#
Initialize Client#
You need to upload the training files to the training
path in NeMo Data Store, and validation files to the validation
path. You can have multiple files in each path and they will all be used.
Note
NeMo Customizer expects training files to be in the training
folder and validation files to be in the validation
folder. Make sure that you upload the files to the right directory.
To set up the Hugging Face API client, you’ll need these configuration values:
Host URL for the entity store service
Host URL for the data storage service
A namespace to organize your resources
Name of your dataset
from huggingface_hub import HfApi
import os
# Configuration
ENTITY_HOST = # Replace with the public url of your Entity Store
DS_HOST = # Replace with the public url of you Datastore
NAMESPACE = "default"
DATASET_NAME = "test-dataset" # dataset name needs to be unique for the namespace
# Initialize Hugging Face API client
HF_API = HfApi(endpoint=f"{DS_HOST}/v1/hf", token="")
Create Namespaces#
Set the namespace we defined in our configuration values in both the NeMo Entity Store and the NeMo Data Store so that they match.
def create_namespaces(entity_host, ds_host, namespace):
# Create namespace in entity store
entity_store_url = f"{entity_host}/v1/namespaces"
resp = requests.post(entity_store_url, json={"id": namespace})
assert resp.status_code in (200, 201, 409, 422), \
f"Unexpected response from Entity Store during Namespace creation: {resp.status_code}"
# Create namespace in datastore
nds_url = f"{ds_host}/v1/datastore/namespaces"
resp = requests.post(nds_url, data={"namespace": namespace})
assert resp.status_code in (200, 201, 409, 422), \
f"Unexpected response from datastore during Namespace creation: {resp.status_code}"
create_namespaces(ENTITY_HOST, DS_HOST, NAMESPACE)
Set Up Dataset Repository#
Create a dataset repository in NeMo Data Store.
def setup_dataset_repo(hf_api, namespace, dataset_name, entity_host):
repo_id = f"{namespace}/{dataset_name}"
# Create the repo in datastore
hf_api.create_repo(repo_id, repo_type="dataset", exist_ok=True)
# Create dataset in entity store
entity_store_url = f"{entity_host}/v1/datasets"
payload = {
"name": dataset_name,
"namespace": namespace,
"files_url": f"hf://datasets/{repo_id}"
}
resp = requests.post(entity_store_url, json=payload)
assert resp.status_code in (200, 201, 409, 422), \
f"Unexpected response from Entity Store creating dataset: {resp.status_code}"
return repo_id
repo_id = setup_dataset_repo(HF_API, NAMESPACE, DATASET_NAME, ENTITY_HOST)
Upload Files#
Upload the training and validation files to the dataset.
def upload_dataset_files(hf_api, repo_id):
# Upload training file
hf_api.upload_file(
path_or_fileobj="train.ndjson",
path_in_repo="training/training_file.jsonl",
repo_id=repo_id,
repo_type="dataset",
revision="main",
commit_message=f"Training file for {repo_id}"
)
# Upload validation file
hf_api.upload_file(
path_or_fileobj="validation.ndjson",
path_in_repo="validation/validation_file.jsonl",
repo_id=repo_id,
repo_type="dataset",
revision="main",
commit_message=f"Validation file for {repo_id}"
)
upload_dataset_files(HF_API, repo_id)
Checkpoint
At this point, we’ve uploaded our training and validation files to the dataset and are ready to define the details of our customization job.
Start Model Customization Job#
Set Hyperparameters#
While model customization configurations come with default settings, you can customize your training by specifying additional hyperparameters in the hyperparameters
field of your customization job.
To train a LoRA model, we must:
Set the
training_type
tosft
(Supervised Fine-Tuning).Set the
finetuning_type
tolora
.Configure lora-specific hyperparameters such as
adapter_dim
andadapter_dropout
.
Example configuration:
{
"hyperparameters": {
"training_type": "sft",
"finetuning_type": "lora",
"epochs": 10,
"batch_size": 16,
"learning_rate": 0.0001,
"lora": {
"adapter_dim": 8,
"adapter_dropout": 0.1
}
}
}
Note
For more information on hyperparameter options, review the Hyperparameter Options reference.
Create and Submit Training Job#
Use the following command to start a LoRA training job. Replace meta/llama-3.1-8b-instruct
with your chosen model configuration and test-dataset
with your dataset name.
Create a job using the model configuration (
config
),dataset
, andhyperparameters
we defined in the previous sections.curl -X "POST" \ "https://${CUSTOMIZER_HOSTNAME}/v1/customization/jobs" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d ' { "config": "meta/llama-3.1-8b-instruct", "dataset": {"name": "test-dataset"}, "hyperparameters": { "training_type": "sft", "finetuning_type": "lora", "epochs": 10, "batch_size": 16, "learning_rate": 0.0001, "lora": { "adapter_dim": 8, "adapter_dropout": 0.1 } } }' | jq
Review the response.
Example Response
{ "id": "cust-Pi95UoDbNcqwgkruAB8LY6", "created_at": "2025-02-19T20:10:06.278132", "updated_at": "2025-02-19T20:10:06.278133", "namespace": "default", "config": { "schema_version": "1.0", "id": "58bee815-0473-45d7-a5e6-fc088f6142eb", "namespace": "default", "created_at": "2025-02-19T20:10:06.454149", "updated_at": "2025-02-19T20:10:06.454160", "custom_fields": {}, "name": "meta/llama-3.1-8b-instruct", "base_model": "meta/llama-3.1-8b-instruct", "model_path": "llama-3_1-8b-instruct", "training_types": [ "sft" ], "finetuning_types": [ "lora" ], "precision": "bf16", "num_gpus": 4, "num_nodes": 1, "micro_batch_size": 1, "tensor_parallel_size": 1, "max_seq_length": 4096 }, "dataset": "default/test-dataset", "hyperparameters": { "finetuning_type": "lora", "training_type": "sft", "batch_size": 8, "epochs": 50, "learning_rate": 0.0001, "lora": { "adapter_dim": 8, "adapter_dropout": 0.1 } }, "output_model": "default/meta-llama-3.1-8b-instruct-test-dataset-lora@cust-Pi95UoDbNcqwgkruAB8LY6", "status": "created", "custom_fields": {} }
Copy the following values from the response:
id
output_model
We’ll need them later to monitor the job’s status and access the fine-tuned model.
Monitor Job Status#
After starting the job, you can monitor its status. This will provide details on the job’s progress, including metrics like train_loss
and val_loss
.
-
JOB_ID="cust-Pi95UoDbNcqwgkruAB8LY6" curl -X "GET" \ "https://${CUSTOMIZER_HOSTNAME}/v1/customization/jobs/${JOB_ID}/status" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' | jq
Review the response to check the job’s status.
Example Response
{ "id": "cust-Pi95UoDbNcqwgkruAB8LY6", "created_at": "2025-02-19T20:10:08.659192", "updated_at": "2025-02-19T20:10:08.659205", "config": { "schema_version": "1.0", "id": "58bee815-0473-45d7-a5e6-fc088f6142eb", "namespace": "default", "created_at": "2025-02-19T20:10:06.454149", "updated_at": "2025-02-19T20:10:06.454160", "custom_fields": {}, "name": "meta/llama-3.1-8b-instruct", "base_model": "meta/llama-3.1-8b-instruct", "model_path": "llama-3_1-8b-instruct", "training_types": [ "sft" ], "finetuning_types": [ "lora" ], "precision": "bf16", "num_gpus": 4, "num_nodes": 1, "micro_batch_size": 1, "tensor_parallel_size": 1, "max_seq_length": 4096 }, "dataset": "default/elby-test-dataset", "hyperparameters": { "finetuning_type": "lora", "training_type": "sft", "batch_size": 8, "epochs": 50, "learning_rate": 0.0001, "lora": { "adapter_dim": 8, "adapter_dropout": 0.1 } }, "output_model": "default/meta-llama-3.1-8b-instruct-elby-test-dataset-lora@cust-Pi95UoDbNcqwgkruAB8LY6", "status_details": { "created_at": "2025-02-19T20:10:08.658412", "updated_at": "2025-02-19T20:10:08.658412", "steps_completed": 0, "epochs_completed": 0, "percentage_done": 0.0, "best_epoch": null, "train_loss": null, "val_loss": null, "metrics": { "keys": [ "train_loss", "val_loss" ], "metrics": { "train_loss": [], "val_loss": [] } }, "status_logs": [ { "updated_at": "2025-02-19T20:10:08.658412", "message": null, "detail": null }, { "updated_at": "2025-02-19T20:10:53", "message": "EntityHandlerPending", "detail": "ENTITY_HANDLER_PENDING_0 The job is pending" }, { "updated_at": "2025-02-19T20:10:53", "message": "EntityHandlerCompleted", "detail": "ENTITY_HANDLER_COMPLETED_0 The job has completed" }, { "updated_at": "2025-02-19T20:10:53", "message": "TrainingJobCreated", "detail": "TRAINING_JOB_CREATED The job has been created" }, { "updated_at": "2025-02-19T20:10:53", "message": "TrainingJobPending", "detail": "TRAINING_JOB_PENDING The job is pending" }, { "updated_at": "2025-02-19T20:11:02", "message": "TrainingJobRunning", "detail": "TRAINING_JOB_PENDING The job is running" }, { "updated_at": "2025-02-19T20:10:08", "message": "PVCCreated", "detail": "PVC_CREATED The PVC has been created" }, { "updated_at": "2025-02-19T20:10:08", "message": "EntityHandlerCreated", "detail": "ENTITY_HANDLER_CREATED_0 The job has been created" }, { "updated_at": "2025-02-19T20:10:10", "message": "EntityHandlerRunning", "detail": "ENTITY_HANDLER_PENDING_0 The job is running" } ] }, "status": "failed", "custom_fields": {}, "ownership": { "created_by": "", "access_policies": {} } }
Tip
The epochs_completed
indicates how many epochs that the job has run. We automatically stop early if the best_epoch
, as indicated by the val_loss
, has not improved for 10 epochs. Consequently, a job can finish without the epochs_completed
reaching the epochs
in the job’s hyperparameters
. In this case, the percentage_done
will be below 100.0, even when the has completed
.
Conclusion#
You have successfully started a LoRA customization job. After the job is complete, you can use the output_model
name to access the fine-tuned model and evaluate its performance.
Next Steps#
Learn how to check customization job metrics using the id
.