Optimize for Tokens/GPU Throughput#

Learn how to use the NeMo Microservices Customizer to create a LoRA (Low-Rank Adaptation) customization job optimized for higher tokens/GPU throughput and lower runtime. In this tutorial, we’ll use LoRA to fine-tune a model and leverage sequence packing feature to improve GPU utilization and decrease fine-tuning runtime.

Note

The time to complete this tutorial is approximately 30 minutes. In this tutorial, you run a customization job. Job duration increases with the number of model parameters and the dataset size.

Prerequisites#

  • SQuAD dataset uploaded to NeMo MS Datastore and registered in NeMo MS Entity Store. Refer to the LoRA model customization tutorial for details how to upload and register the dataset.

  • (Optional) Weights & Biases account and API key for enhanced visualization.


Create LoRA Customization Jobs#

Important

The config field must include a version, for example: meta/llama-3.2-1b-instruct@v1.0.0+A100. Omitting the version will result in an error like:

{ "detail": "Version is not specified in the config URN: meta/llama-3.2-1b-instruct" }

You can find the correct config URN (with version) by inspecting the output of the /customization/configs endpoint. Use the name and version fields to construct the URN as name@version.

For this tutorial, you must create two LoRA customization jobs: one with sequence_packing_enabled set to true and another with the same field set to false.

Tip

For enhanced visualization, it’s recommended to provide a WandB API key in the wandb-api-header HTTP header. Remove wandb-api-header from the request if the WANDB_API_KEY is not set.

  1. Create the customization job with sequence packing enabled:

    from nemo_microservices import NeMoMicroservices
    import os
    
    # Initialize the client
    client = NeMoMicroservices(
        base_url=f"${CUSTOMIZER_BASE_URL}"
    )
    
    # Optional: Set up WandB API key for enhanced visualization
    extra_headers = {}
    if os.getenv('WANDB_API_KEY'):
        extra_headers['wandb-api-key'] = os.getenv('WANDB_API_KEY')
    
    # Create a customization job with sequence packing enabled
    job_packed = client.customization.jobs.create(
        config="meta/llama-3.2-1b-instruct@v1.0.0+A100",
        dataset={
            "namespace": "<namespace>",
            "name": "test-dataset"
        },
        hyperparameters={
            "sequence_packing_enabled": True,
            "training_type": "sft",
            "finetuning_type": "lora",
            "epochs": 10,
            "batch_size": 16,
            "learning_rate": 0.00001,
            "lora": {
                "adapter_dim": 16
            }
        },
        extra_headers=extra_headers
    )
    
    print(f"Created job with sequence packing enabled:")
    print(f"  Job ID: {job_packed.id}")
    print(f"  Status: {job_packed.status}")
    print(f"  Output model: {job_packed.output_model}")
    
    # Save the job ID for monitoring
    packed_job_id = job_packed.id
    
    export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
    
    curl --location \
    "https://${CUSTOMIZER_BASE_URL}/customization/jobs" \
    --header "wandb-api-key: ${WANDB_API_KEY}" \
    --header 'Accept: application/json' \
    --header 'Content-Type: application/json' \
    --data '{
        "config": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
        "dataset": {
            "namespace": "<namespace>",
            "name": "test-dataset"
        },
        "hyperparameters": {
            "sequence_packing_enabled": true,
            "training_type": "sft",
            "finetuning_type": "lora",
            "epochs": 10,
            "batch_size": 16,
            "learning_rate": 0.00001,
            "lora": {
                "adapter_dim": 16
            }
        }
    }' | jq
    
  2. Note the customization_id. It will be needed later.

  3. Create another customization job with sequence packing disabled:

    from nemo_microservices import NeMoMicroservices
    import os
    
    # Initialize the client
    client = NeMoMicroservices(
        base_url=f"${CUSTOMIZER_BASE_URL}"
    )
    
    # Optional: Set up WandB API key for enhanced visualization
    extra_headers = {}
    if os.getenv('WANDB_API_KEY'):
        extra_headers['wandb-api-key'] = os.getenv('WANDB_API_KEY')
    
    # Create a customization job with sequence packing disabled
    job_unpacked = client.customization.jobs.create(
        config="meta/llama-3.2-1b-instruct@v1.0.0+A100",
        dataset={
            "namespace": "<namespace>",
            "name": "test-dataset"
        },
        hyperparameters={
            "sequence_packing_enabled": False,
            "training_type": "sft",
            "finetuning_type": "lora",
            "epochs": 10,
            "batch_size": 16,
            "learning_rate": 0.00001,
            "lora": {
                "adapter_dim": 16
            }
        },
        extra_headers=extra_headers
    )
    
    print(f"Created job with sequence packing disabled:")
    print(f"  Job ID: {job_unpacked.id}")
    print(f"  Status: {job_unpacked.status}")
    print(f"  Output model: {job_unpacked.output_model}")
    
    # Save the job ID for monitoring
    unpacked_job_id = job_unpacked.id
    
    export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
    
    curl --location \
    "https://${CUSTOMIZER_BASE_URL}/customization/jobs" \
    --header "wandb-api-key: ${WANDB_API_KEY}" \
    --header 'Accept: application/json' \
    --header 'Content-Type: application/json' \
    --data '{
        "config": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
        "dataset": {
            "namespace": "<namespace>",
            "name": "test-dataset"
        },
        "hyperparameters": {
            "sequence_packing_enabled": false,
            "training_type": "sft",
            "finetuning_type": "lora",
            "epochs": 10,
            "batch_size": 16,
            "learning_rate": 0.00001,
            "lora": {
                "adapter_dim": 16
            }
        }
    }' | jq
    
  4. Note the customization_id. It will be needed later.

Monitor LoRA Customization Jobs#

Use the customization_id from each job to make a GET request for status details.

from nemo_microservices import NeMoMicroservices

# Initialize the client
client = NeMoMicroservices(
    base_url=f"{CUSTOMIZER_BASE_URL}"
)

# Monitor both jobs
def monitor_jobs(packed_job_id, unpacked_job_id):
    # Get status for sequence packing enabled job
    packed_status = client.customization.jobs.status(packed_job_id)
    print(f"Sequence Packing Enabled Job ({packed_job_id}):")
    print(f"  Status: {packed_status.status}")
    print(f"  Progress: {packed_status.status_details.percentage_done}%")
    print(f"  Epochs completed: {packed_status.status_details.epochs_completed}")
    
    # Get status for sequence packing disabled job
    unpacked_status = client.customization.jobs.status(unpacked_job_id)
    print(f"\nSequence Packing Disabled Job ({unpacked_job_id}):")
    print(f"  Status: {unpacked_status.status}")
    print(f"  Progress: {unpacked_status.status_details.percentage_done}%")
    print(f"  Epochs completed: {unpacked_status.status_details.epochs_completed}")
    
    # Compare training metrics if available
    if (packed_status.status_details.metrics and 
        unpacked_status.status_details.metrics):
        
        packed_metrics = packed_status.status_details.metrics.metrics
        unpacked_metrics = unpacked_status.status_details.metrics.metrics
        
        print(f"\nTraining Metrics Comparison:")
        if packed_metrics.get("val_loss") and unpacked_metrics.get("val_loss"):
            packed_val_loss = packed_metrics["val_loss"][-1] if packed_metrics["val_loss"] else None
            unpacked_val_loss = unpacked_metrics["val_loss"][-1] if unpacked_metrics["val_loss"] else None
            print(f"  Packed Val Loss: {packed_val_loss}")
            print(f"  Unpacked Val Loss: {unpacked_val_loss}")

# Usage example (replace with your actual job IDs)
monitor_jobs("your-packed-job-id", "your-unpacked-job-id")
curl ${CUSTOMIZER_BASE_URL}/customization/jobs/${customizationID}/status | jq

The response includes timestamped training and validation loss values. The expected validation loss for both jobs should be similar.

View Jobs in Weights & Biases#

To enable W&B integration, include your WandB API key when creating a customization job in the call header. Then view your results at wandb.ai under the nvidia-nemo-customizer project.

Validation Loss Curves#

The expected validation loss curves should match closely for both jobs. W&B chart - val_loss

Sequence packed version should complete significantly faster. W&B chart - runtime

GPU Utilization#

Sequence packed version should have a higher GPU utilization. W&B chart - gpu utilizaiton

GPU Memory Allocation#

Sequence packed version should have a higher GPU Memory Allocation. W&B chart - gpu memory

Sequence Packing Statistics#

Sequence packing statistics can be found under run config. W&B chart - sequence packing stats

Note

The W&B integration is optional. When enabled, we’ll send training metrics to W&B using your API key. While we encrypt your API key and don’t log it internally, please review W&B’s terms of service before use.

Next Steps#

Now that you have created an optimized customization job, you can evaluate the output using NeMo Microservices Evaluator.