Optimize for Tokens/GPU Throughput#

Learn how to use the NeMo Microservices Customizer to create a LoRA (Low-Rank Adaptation) customization job optimized for higher tokens/GPU throughput and lower runtime. In this tutorial, we’ll use LoRA to fine-tune a model and leverage sequence packing feature to improve GPU utilization and decrease fine-tuning runtime.

Note

The time to complete this tutorial is approximately 30 minutes. In this tutorial, you run a customization job. Job duration increases with the number of model parameters and the dataset size.

Prerequisites#

SQuAD dataset uploaded to NeMo MS Datastore and registered in NeMo MS Entity Store. Refer to the LoRA model customization tutorial for details how to upload and register the dataset.
(Optional) Weights & Biases account and API key for enhanced visualization.

Create LoRA Customization Jobs#

Important

The config field must include a version, for example: meta/llama-3.2-1b-instruct@v1.0.0+A100. Omitting the version will result in an error like:

{ "detail": "Version is not specified in the config URN: meta/llama-3.2-1b-instruct" }

You can find the correct config URN (with version) by inspecting the output of the /v1/customization/configs endpoint. Use the name and version fields to construct the URN as name@version.

For this tutorial, you must create two LoRA customization jobs: one with sequence_packing_enabled set to true and another with the same field set to false.

Tip

For enhanced visualization, it’s recommended to provide a WandB API key in the wandb-api-header HTTP header. Remove wandb-api-header from the request if the WANDB_API_KEY is not set.

Create the customization job with sequence packing enabled:

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>

curl --location \
"https://${CUST_HOSTNAME}/v1/customization/jobs" \
--header "wandb-api-key: ${WANDB_API_KEY}" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '{
    "config": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
    "dataset": {
        "name": "test-dataset"
    },
    "hyperparameters": {
        "sequence_packing_enabled": true,
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 10,
        "batch_size": 16,
        "learning_rate": 0.00001,
        "lora": {
            "adapter_dim": 16
        }
    }
}' | jq

Note the customization_id. It will be needed later.

Create another customization job with sequence packing disabled:

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>

curl --location \
"https://${CUST_HOSTNAME}/v1/customization/jobs" \
--header "wandb-api-key: ${WANDB_API_KEY}" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '{
    "config": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
    "dataset": {
        "name": "test-dataset"
    },
    "hyperparameters": {
        "sequence_packing_enabled": false,
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 10,
        "batch_size": 16,
        "learning_rate": 0.00001,
        "lora": {
            "adapter_dim": 16
        }
    }
}' | jq

Note the customization_id. It will be needed later.

Monitor LoRA Customization Jobs#

Use the customization_id from each job to make a GET request for status details.

curl ${CUST_HOSTNAME}/v1/customization/jobs/${customizationID}/status | jq

The response includes timestamped training and validation loss values. The expected validation loss for both jobs should be similar.

View Jobs in Weights & Biases#

To enable W&B integration, include your WandB API key when creating a customization job in the call header. Then view your results at wandb.ai under the nvidia-nemo-customizer project.

Validation Loss Curves#

The expected validation loss curves should match closely for both jobs. W&B chart - val_loss

Sequence packed version should complete significantly faster. W&B chart - runtime

GPU Utilization#

Sequence packed version should have a higher GPU utilization. W&B chart - gpu utilizaiton

GPU Memory Allocation#

Sequence packed version should have a higher GPU Memory Allocation. W&B chart - gpu memory

Sequence Packing Statistics#

Sequence packing statistics can be found under run config. W&B chart - sequence packing stats

Note

The W&B integration is optional. When enabled, we’ll send training metrics to W&B using your API key. While we encrypt your API key and don’t log it internally, please review W&B’s terms of service before use.

Next Steps#

Now that you have created an optimized customization job, you can evaluate the output using NeMo Microservices Evaluator.