Optimize for Tokens/GPU Throughput#
Learn how to use the NeMo Microservices Customizer to create a LoRA (Low-Rank Adaptation) customization job optimized for higher tokens/GPU throughput and lower runtime. In this tutorial, we’ll use LoRA to fine-tune a model and leverage sequence packing feature to improve GPU utilization and decrease fine-tuning runtime.
Note
The time to complete this tutorial is approximately 30 minutes. In this tutorial, you run a customization job. Job duration increases with the number of model parameters and the dataset size.
Prerequisites#
SQuAD dataset uploaded to NeMo MS Datastore and registered in NeMo MS Entity Store. Refer to the LoRA model customization tutorial for details how to upload and register the dataset.
(Optional) Weights & Biases account and API key for enhanced visualization.
Create LoRA Customization Jobs#
For this tutorial, you must create two LoRA customization jobs: one with sequence_packing_enabled
set to true
and another with the same field set to false
.
Tip
For enhanced visualization, it’s recommended to provide a WandB API key in the wandb-api-header
HTTP header. Remove wandb-api-header
from the request if the WANDB_API_KEY
is not set.
Create the customziation job with sequence packing enabled:
export WANDB_API_KEY=<YOUR_WANDB_API_KEY> curl --location \ "https://${CUST_HOSTNAME}/v1/customization/jobs" \ --header "wandb-api-key: ${WANDB_API_KEY}" \ --header 'Accept: application/json' \ --header 'Content-Type: application/json' \ --data '{ "config": "meta/llama-3.1-8b-instruct", "dataset": { "name": "test-dataset" }, "hyperparameters": { "sequence_packing_enabled": true, "training_type": "sft", "finetuning_type": "lora", "epochs": 10, "batch_size": 32, "learning_rate": 0.00001, "lora": { "adapter_dim": 16 } } }' | jq
Note the
customization_id
. It will be needed later.Create another customization job with sequence packing disabled:
export WANDB_API_KEY=<YOUR_WANDB_API_KEY> curl --location \ "https://${CUST_HOSTNAME}/v1/customization/jobs" \ --header "wandb-api-key: ${WANDB_API_KEY}" --header 'Accept: application/json' \ --header 'Content-Type: application/json' \ --data '{ "config": "meta/llama-3.1-8b-instruct", "dataset": { "name": "test-dataset" }, "hyperparameters": { "sequence_packing_enabled": false, "training_type": "sft", "finetuning_type": "lora", "epochs": 10, "batch_size": 32, "learning_rate": 0.00001, "lora": { "adapter_dim": 16 } } }' | jq
Note the
customization_id
. It will be needed later.
Monitor LoRA Customization Jobs#
Use the customization_id
from each job to make a GET request for status details.
curl ${CUST_HOSTNAME}/v1/customization/jobs/${customizationID}/status | jq
The response includes timestamped training and validation loss values. The expected validation loss for both jobs should be similar.
View Jobs in Weights & Biases#
To enable W&B integration, include your WandB API key when creating a customization job in the call header.
Then view your results at wandb.ai under the nvidia-nemo-customizer
project.
Validation Loss Curves#
The expected validation loss curves should match closely for both jobs.
Sequence packed version should complete significantly faster.
GPU Utilization#
Sequence packed version should have a higher GPU utilization.
GPU Memory Allocation#
Sequence packed version should have a higher GPU Memory Allocation.
Sequence Packing Statistics#
Sequence packing statistics can be found under run config.
Note
The W&B integration is optional. When enabled, we’ll send training metrics to W&B using your API key. While we encrypt your API key and don’t log it internally, please review W&B’s terms of service before use.
Next Steps#
Now that you have created an optimized customization job, you can evaluate the output using NeMo Microservices Evaluator.