Optimize for Tokens/GPU Throughput#

Learn how to use the NeMo Microservices Customizer to create a LoRA (Low-Rank Adaptation) customization job optimized for higher tokens/GPU throughput and lower runtime. In this tutorial, we’ll use LoRA to fine-tune a model and leverage sequence packing feature to improve GPU utilization and decrease fine-tuning runtime.

Note

The time to complete this tutorial is approximately 30 minutes. In this tutorial, you run a customization job. Job duration increases with the number of model parameters and the dataset size.

Prerequisites#

Platform Prerequisites#

NeMo Customizer Prerequisites#

Tutorial-Specific Prerequisites#

SQuAD dataset uploaded to NeMo MS Datastore and registered in NeMo MS Entity Store. Refer to the LoRA model customization tutorial for details how to upload and register the dataset.

Create LoRA Customization Jobs#

Important

The config field must include a version, for example: meta/llama-3.2-1b-instruct@v1.0.0+A100. Omitting the version will result in an error like:

{ "detail": "Version is not specified in the config URN: meta/llama-3.2-1b-instruct" }

You can find the correct config URN (with version) by inspecting the output of the /customization/configs endpoint. Use the name and version fields to construct the URN as name@version.

For this tutorial, you must create two LoRA customization jobs: one with sequence_packing_enabled set to true and another with the same field set to false.

Tip

For enhanced visualization, it’s recommended to provide a WandB API key in the wandb-api-key HTTP header. Remove wandb-api-key from the request if the WANDB_API_KEY is not set.

Create the customization job with sequence packing enabled:

Python SDK

from nemo_microservices import NeMoMicroservices
import os

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['CUSTOMIZER_BASE_URL']
)

# Set up WandB API key for enhanced visualization
extra_headers = {}
if os.getenv('WANDB_API_KEY'):
    extra_headers['wandb-api-key'] = os.getenv('WANDB_API_KEY')

# Create a customization job with sequence packing enabled
job_packed = client.customization.jobs.create(
    config="meta/llama-3.2-1b-instruct@v1.0.0+A100",
    dataset={
        "namespace": "<namespace>",
        "name": "test-dataset"
    },
    hyperparameters={
        "sequence_packing_enabled": True,
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 10,
        "batch_size": 16,
        "learning_rate": 0.00001,
        "lora": {
            "adapter_dim": 16
        }
    },
    extra_headers=extra_headers
)

print(f"Created job with sequence packing enabled:")
print(f"  Job ID: {job_packed.id}")
print(f"  Status: {job_packed.status}")
print(f"  Output model: {job_packed.output_model}")

# Save the job ID for monitoring
packed_job_id = job_packed.id

cURL

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>

curl --location \
"https://${CUSTOMIZER_BASE_URL}/v1/customization/jobs" \
--header "wandb-api-key: ${WANDB_API_KEY}" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '{
    "config": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
    "dataset": {
        "namespace": "<namespace>",
        "name": "test-dataset"
    },
    "hyperparameters": {
        "sequence_packing_enabled": true,
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 10,
        "batch_size": 16,
        "learning_rate": 0.00001,
        "lora": {
            "adapter_dim": 16
        }
    }
}' | jq

Note the customization_id. It will be needed later.

Create another customization job with sequence packing disabled:

Python SDK

from nemo_microservices import NeMoMicroservices
import os

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['CUSTOMIZER_BASE_URL']
)

# Set up WandB API key for enhanced visualization
extra_headers = {}
if os.getenv('WANDB_API_KEY'):
    extra_headers['wandb-api-key'] = os.getenv('WANDB_API_KEY')

# Create a customization job with sequence packing disabled
job_unpacked = client.customization.jobs.create(
    config="meta/llama-3.2-1b-instruct@v1.0.0+A100",
    dataset={
        "namespace": "<namespace>",
        "name": "test-dataset"
    },
    hyperparameters={
        "sequence_packing_enabled": False,
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 10,
        "batch_size": 16,
        "learning_rate": 0.00001,
        "lora": {
            "adapter_dim": 16
        }
    },
    extra_headers=extra_headers
)

print(f"Created job with sequence packing disabled:")
print(f"  Job ID: {job_unpacked.id}")
print(f"  Status: {job_unpacked.status}")
print(f"  Output model: {job_unpacked.output_model}")

# Save the job ID for monitoring
unpacked_job_id = job_unpacked.id

cURL

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>

curl --location \
"https://${CUSTOMIZER_BASE_URL}/v1/customization/jobs" \
--header "wandb-api-key: ${WANDB_API_KEY}" \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '{
    "config": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
    "dataset": {
        "namespace": "<namespace>",
        "name": "test-dataset"
    },
    "hyperparameters": {
        "sequence_packing_enabled": false,
        "training_type": "sft",
        "finetuning_type": "lora",
        "epochs": 10,
        "batch_size": 16,
        "learning_rate": 0.00001,
        "lora": {
            "adapter_dim": 16
        }
    }
}' | jq

Note the customization_id. It will be needed later.

Monitor LoRA Customization Jobs#

Use the customization_id from each job to make a GET request for status details.

Python SDK

# Monitor both jobs
def monitor_jobs(packed_job_id, unpacked_job_id):
    # Get status for sequence packing enabled job
    packed_status = client.customization.jobs.status(packed_job_id)
    print(f"Sequence Packing Enabled Job ({packed_job_id}):")
    print(f"  Status: {packed_status.status}")
    print(f"  Progress: {packed_status.status_details.percentage_done}%")
    print(f"  Epochs completed: {packed_status.status_details.epochs_completed}")

    # Get status for sequence packing disabled job
    unpacked_status = client.customization.jobs.status(unpacked_job_id)
    print(f"\nSequence Packing Disabled Job ({unpacked_job_id}):")
    print(f"  Status: {unpacked_status.status}")
    print(f"  Progress: {unpacked_status.status_details.percentage_done}%")
    print(f"  Epochs completed: {unpacked_status.status_details.epochs_completed}")

    # Compare training metrics if available
    if (packed_status.status_details.metrics and
        unpacked_status.status_details.metrics):

        packed_metrics = packed_status.status_details.metrics.metrics
        unpacked_metrics = unpacked_status.status_details.metrics.metrics

        print(f"\nTraining Metrics Comparison:")
        if packed_metrics.get("val_loss") and unpacked_metrics.get("val_loss"):
            packed_val_loss = packed_metrics["val_loss"][-1] if packed_metrics["val_loss"] else None
            unpacked_val_loss = unpacked_metrics["val_loss"][-1] if unpacked_metrics["val_loss"] else None
            print(f"  Packed Val Loss: {packed_val_loss}")
            print(f"  Unpacked Val Loss: {unpacked_val_loss}")

# Usage example (replace with your actual job IDs)
monitor_jobs("your-packed-job-id", "your-unpacked-job-id")

cURL

curl ${CUSTOMIZER_BASE_URL}/v1/customization/jobs/${customizationID}/status | jq

The response includes timestamped training and validation loss values. The expected validation loss for both jobs should be similar.

View Jobs in Weights & Biases#

To enable W&B integration, include your WandB API key when creating a customization job in the call header. Then view your results at wandb.ai under the nvidia-nemo-customizer project.

Validation Loss Curves#

The expected validation loss curves should match closely for both jobs. W&B chart - val_loss

Sequence packed version should complete significantly faster. W&B chart - runtime

GPU Utilization#

Sequence packed version should have a higher GPU utilization. W&B chart - gpu utilization

GPU Memory Allocation#

Sequence packed version should have a higher GPU Memory Allocation. W&B chart - gpu memory

Sequence Packing Statistics#

Sequence packing statistics can be found under run config. W&B chart - sequence packing stats

Note

The W&B integration is optional. When enabled, we’ll send training metrics to W&B using your API key. While we encrypt your API key and don’t log it internally, please review W&B’s terms of service before use.

Next Steps#

Learn how to check customization job metrics to monitor training progress and compare performance between your sequence-packed and regular jobs. You can also evaluate the output using NeMo Microservices Evaluator.