Start an Embedding Model Customization Job#

Learn how to use the NeMo Microservices Platform to create a LoRA merged customization job for embedding models using a custom dataset. In this tutorial, we’ll fine-tune the NVIDIA Llama 3.2 NV EmbedQA 1B model for improved question-answering and retrieval tasks.

Embedding models are specialized for semantic search, document retrieval, question-answering systems, and RAG (Retrieval-Augmented Generation) pipelines. The lora_merged approach combines the efficiency of LoRA training with the deployment simplicity of full-weight models.

Note

The time to complete this tutorial is approximately 60 minutes. In this tutorial, you run a customization job and deploy the resulting model. Job duration increases with the dataset size and number of epochs.

Prerequisites#

Platform Prerequisites#

New to using NeMo microservices?

NeMo microservices use an entity management system to organize all resources—including datasets, models, and job artifacts—into namespaces and projects. Without setting up these organizational entities first, you cannot use the microservices.

If you’re new to the platform, complete these foundational tutorials first:

  1. Get Started Tutorials: Learn how to deploy, customize, and evaluate models using the platform end-to-end

  2. Set Up Organizational Entities: Learn how to create namespaces and projects to organize your work

If you’re already familiar with namespaces, projects, and how to upload datasets to the platform, you can proceed directly with this tutorial.

Learn more: Entity Concepts

NeMo Customizer Prerequisites#

Microservice Setup Requirements and Environment Variables

Before starting, make sure you have:

  • Access to NeMo Customizer

  • The huggingface_hub Python package installed

  • (Optional) Weights & Biases account and API key for enhanced visualization

Set up environment variables:

# Set up environment variables
export CUSTOMIZER_BASE_URL="<your-customizer-service-url>"
export ENTITY_HOST="<your-entity-store-url>"
export DS_HOST="<your-datastore-url>"
export NAMESPACE="default"
export DATASET_NAME="test-dataset"

# Hugging Face environment variables (for dataset/model file management)
export HF_ENDPOINT="${DS_HOST}/v1/hf"
export HF_TOKEN="dummy-unused-value"  # Or your actual HF token

# Optional monitoring
export WANDB_API_KEY="<your-wandb-api-key>"

Replace the placeholder values with your actual service URLs and credentials.

Tutorial-Specific Prerequisites#

Enable Embedding Model Target#

The embedding model target is enabled by default. However, if disabled, contact your administrator to enable the nvidia/llama-3.2-nv-embedqa-1b@v2 target and add the lora_merged training option to the configuration.

Tip

For guidance on requesting access to disabled configurations, including example request templates, refer to the Understanding NeMo Customizer Configurations and Models tutorial.

# Enable the embedding model target
nvidia/llama-3.2-nv-embedqa-1b@v2:
  enabled: true

# Add lora_merged training option to the configuration if not included in your deployment.
nvidia/llama-3.2-nv-embedqa-1b@v2+A100:
  training_options:
    - training_type: sft
      finetuning_type: lora_merged
      num_gpus: 1
      num_nodes: 1
      tensor_parallel_size: 1
      micro_batch_size: 1

Select Model#

Find Available Embedding Configs#

First, we need to identify if the embedding model configuration is available and supports lora_merged training.

Note

The /customization/configs endpoint returns only enabled configurations by default. If you don’t see the embedding model configuration in the results, verify that your administrator has enabled the nvidia/llama-3.2-nv-embedqa-1b@v2 target as described in the Enable Embedding Model Target section above.

  1. Get embedding model configurations:

    from nemo_microservices import NeMoMicroservices
    import os
    
    # Initialize the client
    client = NeMoMicroservices(
        base_url=os.environ['CUSTOMIZER_BASE_URL']
    )
    
    # Find embedding model configurations that support lora_merged
    configs = client.customization.configs.list(
        filter={"finetuning_type": "lora_merged"}
    )
    
    print(f"Found {len(configs.data)} embedding model configuration(s) with lora_merged support")
    
    if len(configs.data) == 0:
        print("\n⚠️  No embedding configurations found.")
        print("This typically means the embedding model target is not enabled.")
        print("Contact your administrator to enable the 'nvidia/llama-3.2-nv-embedqa-1b@v2' target.")
    else:
        for config in configs.data:
            print(f"\nConfig: {config.name}")
            print(f"  Training options: {len(config.training_options)}")
            for option in config.training_options:
                print(f"    - {option.training_type}/{option.finetuning_type}: {option.num_gpus} GPUs")
                if option.finetuning_type == "lora_merged":
                    print(f"      ✓ Supports LoRA merged training")
    
    # Note: This endpoint returns only enabled configurations by default.
    # If you don't see results, the embedding model target may not be enabled.
    curl -X GET "${CUSTOMIZER_BASE_URL}/v1/customization/configs?filter%5Bfinetuning_type%5D=lora_merged" \
      -H 'Accept: application/json' | jq
    
  2. Review the response to confirm the embedding model is available:

    Example Response
    {
      "object": "list",
      "data": [
        {
          "name": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
          "namespace": "default",
          "dataset_schemas": [
            {
              "title": "Newline-Delimited JSON File",
              "type": "array",
              "items": {
                "description": "Schema for Supervised Fine-Tuning (SFT) training data items.",
                "properties": {
                  "prompt": {
                    "description": "The prompt for the entry",
                    "title": "Prompt",
                    "type": "string"
                  },
                  "completion": {
                    "description": "The completion to train on",
                    "title": "Completion",
                    "type": "string"
                  }
                },
                "required": ["prompt", "completion"],
                "title": "SFTDatasetItemSchema",
                "type": "object"
              }
            }
          ],
          "training_options": [
            {
              "training_type": "sft",
              "finetuning_type": "lora",
              "num_gpus": 1,
              "num_nodes": 1,
              "tensor_parallel_size": 1,
              "use_sequence_parallel": false
            },
            {
              "training_type": "sft",
              "finetuning_type": "all_weights",
              "num_gpus": 1,
              "num_nodes": 1,
              "tensor_parallel_size": 1,
              "use_sequence_parallel": false
            }
          ]
        },
        {
          "name": "nvidia/llama-3.2-nv-embedqa-1b@v2+A100",
          "namespace": "nvidia",
          "dataset_schemas": [
            {
              "title": "Newline-Delimited JSON File",
              "type": "array",
              "items": {
                "description": "Schema for embedding training data items.",
                "properties": {
                  "query": {
                    "description": "The query to use as an anchor",
                    "title": "Query",
                    "type": "string"
                  },
                  "pos_doc": {
                    "description": "A document that should match positively with the anchor",
                    "title": "Positive Document",
                    "type": "string"
                  },
                  "neg_doc": {
                    "description": "Documents that should not match with the anchor",
                    "title": "Negative Documents",
                    "type": "array",
                    "items": {"type": "string"}
                  }
                },
                "required": ["query", "pos_doc", "neg_doc"],
                "title": "EmbeddingDatasetItemSchema",
                "type": "object"
              }
            }
          ],
          "training_options": [
            {
              "training_type": "sft",
              "finetuning_type": "lora_merged",
              "num_gpus": 1,
              "num_nodes": 1,
              "tensor_parallel_size": 1,
              "use_sequence_parallel": false
            }
          ]
        }
      ]
    }
    

The response should show the nvidia/llama-3.2-nv-embedqa-1b@v2+A100 configuration with lora_merged training support.

Review Dataset Schema#

Embedding models require a specific triplet dataset format for contrastive learning, different from standard text generation models.


Create Datasets#

Prepare Embedding Dataset#

Embedding models require training data in a triplet format with query, positive document, and negative documents for contrastive learning.

  1. Create embedding training data files:

    # Example embedding dataset format (triplet structure)
    import json
    
    # Create example embedding training data
    embedding_training_data = [
        {
            "query": "3D trajectory recovery for tracking multiple objects",
            "pos_doc": "Recursive Estimation of Motion, Structure, and Focal Length",
            "neg_doc": ["Characterization of 1.2 kV SiC super-junction SBD implemented by trench technique"]
        },
        {
            "query": "machine learning algorithms for natural language processing",
            "pos_doc": "Deep learning approaches to text classification and sentiment analysis",
            "neg_doc": ["Quantum computing applications in cryptography", "Solar panel efficiency optimization methods"]
        },
        {
            "query": "computer vision techniques for object detection",
            "pos_doc": "YOLO and R-CNN architectures for real-time object recognition",
            "neg_doc": ["Database indexing strategies for large datasets", "Network security protocols and implementations"]
        }
    ]
    
    # Save to JSONL format
    with open("embedding_train.jsonl", "w") as f:
        for item in embedding_training_data:
            f.write(json.dumps(item) + "\n")
    
    print("Created embedding training dataset with triplet format")
    
  2. Create validation data using the same format with different examples:

    # Create validation dataset using the same format
    embedding_validation_data = [
        {
            "query": "neural network architectures for image classification",
            "pos_doc": "Convolutional neural networks for computer vision tasks",
            "neg_doc": ["Database management system optimization", "Compiler design principles"]
        },
        {
            "query": "distributed systems and scalability",
            "pos_doc": "Microservices architecture patterns for cloud deployment",
            "neg_doc": ["Mobile app UI design patterns", "Photography composition techniques"]
        }
    ]
    
    # Save validation data to JSONL format
    with open("embedding_validation.jsonl", "w") as f:
        for item in embedding_validation_data:
            f.write(json.dumps(item) + "\n")
    
    print("Created embedding validation dataset")
    

Upload Training Data#

The complete_dataset_upload function handles the complete workflow:

  • Creates namespaces in both Entity Store and Datastore

  • Registers the dataset in Entity Store

  • Creates the dataset repository in Datastore

  • Uploads training and validation files

# Upload embedding dataset with correct filenames
# This function automatically:
#   1. Creates namespaces in Entity Store and Datastore
#   2. Registers the dataset in Entity Store
#   3. Creates dataset repository in Datastore
#   4. Uploads training and validation files
repo_id = complete_dataset_upload(
    entity_host=os.environ['ENTITY_HOST'],
    ds_host=os.environ['DS_HOST'],
    namespace="default",
    dataset_name="embedding-dataset",
    training_file="embedding_train.jsonl",
    validation_file="embedding_validation.jsonl",
    description="Embedding model training dataset"
)
print(f"✅ Dataset uploaded successfully: {repo_id}")
print(f"✅ Dataset registered in Entity Store as: default/embedding-dataset")

Note

Entity Store Registration: This function automatically registers the dataset in Entity Store using entity_client.datasets.create(). You don’t need to register the dataset separately—both Entity Store and Datastore registration are handled in a single workflow.

Checkpoint

At this point, we’ve:

  • ✅ Created namespaces in Entity Store and Datastore

  • ✅ Registered the dataset in Entity Store

  • ✅ Uploaded training and validation files to Datastore

  • ✅ Ready to create our customization job


Start Model Customization Job#

Important

The config field must include a version, for example: nvidia/llama-3.2-nv-embedqa-1b@v2+A100. Omitting the version will result in an error.

You can find the correct config URN (with version) by inspecting the output of the /customization/configs endpoint.

Set Hyperparameters#

For embedding model training with lora_merged, we use specific hyperparameters optimized for contrastive learning:

# Hyperparameters for lora_merged embedding training
hyperparameters = {
    "training_type": "sft",
    "finetuning_type": "lora_merged",  # Key difference for embedding models
    "epochs": 3,
    "batch_size": 8,
    "learning_rate": 5e-5,
    "lora": {
        "adapter_dim": 16,
        "alpha": 32,
        "adapter_dropout": 0.01
    }
}

print("Hyperparameters for lora_merged training:")
print(json.dumps(hyperparameters, indent=2))

Create and Submit Training Job#

  1. Create the embedding model customization job:

    from nemo_microservices import NeMoMicroservices
    import os
    
    # Initialize the client
    client = NeMoMicroservices(
        base_url=os.environ['CUSTOMIZER_BASE_URL']
    )
    
    # Set up WandB API key for enhanced visualization
    extra_headers = {}
    if os.getenv('WANDB_API_KEY'):
        extra_headers['wandb-api-key'] = os.getenv('WANDB_API_KEY')
    
    # Create embedding model customization job
    job = client.customization.jobs.create(
        config="nvidia/llama-3.2-nv-embedqa-1b@v2+A100",  # Embedding model config
        dataset={
            "namespace": "default",
            "name": "embedding-dataset"
        },
        hyperparameters={
            "training_type": "sft",
            "finetuning_type": "lora_merged",
            "epochs": 3,
            "batch_size": 8,
            "learning_rate": 5e-5,
            "lora": {
                "adapter_dim": 16,
                "alpha": 32,
                "adapter_dropout": 0.01
            }
        },
        output_model="default/my-embedding-model@v1",
        extra_headers=extra_headers
    )
    
    print(f"Created embedding job:")
    print(f"  Job ID: {job.id}")
    print(f"  Status: {job.status}")
    print(f"  Output model: {job.output_model}")
    
    # Create embedding model customization job
    curl -X "POST" \
      "${CUSTOMIZER_BASE_URL}/v1/customization/jobs" \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -H "wandb-api-key: ${WANDB_API_KEY}" \
      -d '{
        "config": "nvidia/llama-3.2-nv-embedqa-1b@v2+A100",
        "dataset": {"namespace": "default", "name": "embedding-dataset"},
        "hyperparameters": {
          "training_type": "sft",
          "finetuning_type": "lora_merged",
          "epochs": 3,
          "batch_size": 8,
          "learning_rate": 5e-5,
          "lora": {
            "adapter_dim": 16,
            "alpha": 32,
            "adapter_dropout": 0.01
          }
        },
        "output_model": "default/my-embedding-model@v1"
      }' | jq
    
  2. Review the response:

    Example Response
    {
      "id": "cust-Pi95UoDbNcqwgkruAB8LY6",
      "created_at": "2025-02-19T20:10:06.278132",
      "updated_at": "2025-02-19T20:10:06.278133",
      "namespace": "default",
      "config": {
        "schema_version": "1.0",
        "id": "58bee815-0473-45d7-a5e6-fc088f6142eb",
        "namespace": "default",
        "created_at": "2025-02-19T20:10:06.454149",
        "updated_at": "2025-02-19T20:10:06.454160",
        "custom_fields": {},
        "name": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
        "base_model": "meta/llama-3.2-1b-instruct",
        "model_path": "llama-3_2-1b-instruct",
        "training_types": ["sft"],
        "finetuning_types": ["lora"],
        "precision": "bf16",
        "num_gpus": 1,
        "num_nodes": 1,
        "micro_batch_size": 1,
        "tensor_parallel_size": 1,
        "max_seq_length": 4096
      },
      "dataset": { "namespace": "default", "name": "test-dataset" },
      "hyperparameters": {
        "finetuning_type": "lora",
        "training_type": "sft",
        "batch_size": 16,
        "epochs": 10,
        "learning_rate": 0.0001,
        "lora": {
          "adapter_dim": 8,
          "adapter_dropout": 0.01
        }
      },
      "output_model": "default/meta-llama-3.2-1b-instruct-test-dataset-lora@cust-Pi95UoDbNcqwgkruAB8LY6",
      "status": "created",
      "custom_fields": {}
    }
    
  3. Copy the following values from the response:

    • id (Job ID)

    • output_model

Monitor Job Status#

Monitor the training progress using the job ID:

import time

def monitor_job_progress(client, job_id, check_interval=60):
    """Monitor job progress with comprehensive status handling"""
    while True:
        job_details = client.customization.jobs.retrieve(job_id=job_id)
        
        print(f"Job status: {job_details.status}")
        
        if job_details.status == "completed":
            print("Training completed successfully!")
            break
        elif job_details.status in ["failed", "cancelled"]:
            print(f"Training finished with status: {job_details.status}")
            if job_details.status == "failed":
                print("Check the job logs for error details.")
            break
        elif job_details.status in ["created", "pending"]:
            print("Job is queued and waiting to start...")
        elif job_details.status == "running":
            print("Training is in progress...")
            # Show progress if available
            if hasattr(job_details, 'status_details') and job_details.status_details:
                if hasattr(job_details.status_details, 'percentage_done'):
                    print(f"Progress: {job_details.status_details.percentage_done:.1f}%")
        elif job_details.status == "cancelling":
            print("Job is being cancelled...")
        elif job_details.status in ["ready", "unknown"]:
            print(f"Job finished with status: {job_details.status}")
            break
        else:
            print(f"Unknown status: {job_details.status}")
        
        time.sleep(check_interval)
    
    return job_details

# Usage: final_job_status = monitor_job_progress(client, job_id)

Deploy the Model#

Once the job finishes, the lora_merged model behaves like a full-weight model and requires deployment through the Deployment Management Service.

Important

Unlike regular LoRA adapters, lora_merged models require a dedicated NIM deployment similar to full SFT models. The LoRA weights are merged into the base model during training, creating a complete model artifact.

Deploy Using Deployment Management Service#

import time
from openai import OpenAI

# Assuming output_model_urn is available from the completed job
# output_model_urn = "default/my-embedding-model@v1" # Replace with actual output model URN

# Create deployment configuration
deployment_config = client.deployment.configs.create(
    name="embedding-deploy-config",
    namespace="default",
    description="Configuration for fine-tuned embedding model deployment",
    model=output_model_urn,
    nim_deployment={
        "image_name": "nvcr.io/nim/nvidia/nemo-embedder", # Specific NIM image for embedding models
        "image_tag": "1.0.0", # Use appropriate NIM version
        "gpu": 1 # Adjust based on model size and config
    }
)
print(f"Created deployment config: {deployment_config.name}")

# Deploy the fine-tuned model using the configuration
deployment = client.deployment.model_deployments.create(
    name="embedding-model-deployment",
    namespace="default",
    config="default/embedding-deploy-config"
)
print(f"Created deployment: {deployment.name}")
print(f"Status: {deployment.status_details.status}")

# Monitor deployment status with polling
def wait_for_deployment(client, deployment_name, namespace="default", timeout=1200):
    start_time = time.time()
    while True:
        if time.time() - start_time > timeout:
            raise RuntimeError(f"Deployment timeout after {timeout} seconds")
        deployment_status = client.deployment.model_deployments.retrieve(
            deployment_name, namespace=namespace
        )
        status = deployment_status.status_details.status
        elapsed = time.time() - start_time
        print(f"Deployment status: {status} after {elapsed:.1f}s")
        if status == "ready":
            print("✅ Deployment completed successfully!")
            break
        elif status in ["failed", "cancelled"]:
            raise RuntimeError(f"Deployment {status}")
        time.sleep(10)
    return deployment_status

final_deployment_status = wait_for_deployment(client.deployment, "embedding-model-deployment", namespace="default")
print(f"Model deployed as: {final_deployment_status.models}")

# Test using OpenAI-compatible client
openai_client = OpenAI(
    base_url=f"{NIM_PROXY_BASE_URL}/v1",
    api_key="not-used"
)

# Example inference for embedding model
response = openai_client.embeddings.create(
    model=output_model_urn,
    input=["This is a test query.", "Another sentence to embed."]
)
print("✅ Model inference successful!")
print(f"Response: {response.data[0].embedding[:5]}...") # Print first 5 dimensions

Verify Embeddings#

After deployment, verify that your fine-tuned embedding model is working correctly:

Basic Embedding Generation Test#

# Simple embedding verification using the SDK
from nemo_microservices import NeMoMicroservices

# Initialize client for inference
inference_client = NeMoMicroservices(
    inference_base_url=os.environ['NIM_PROXY_BASE_URL']
)

# Test embedding generation
test_texts = [
    "machine learning algorithms",
    "deep learning neural networks",
    "quantum computing applications"
]

print("Testing embedding generation:")
for text in test_texts:
    try:
        response = inference_client.embeddings.create(
            input=text,
            model="default/my-embedding-model@v1"
        )
        embedding = response.data[0].embedding
        print(f"✅ Generated embedding for '{text}': {len(embedding)} dimensions")
    except Exception as e:
        print(f"❌ Error generating embedding for '{text}': {e}")

Comprehensive Embedding Verification#

For thorough verification, test the embedding quality using cosine similarity:

def comprehensive_embedding_verification(client, model_name, test_dataset_path):
    """Comprehensive embedding model verification"""

    print(f"🔍 Verifying embedding model: {model_name}")
    print("=" * 60)

    # Load test dataset
    test_data = []
    with open(test_dataset_path, "r") as f:
        for line in f:
            test_data.append(json.loads(line))

    print(f"📊 Loaded {len(test_data)} test examples")

    # Test 1: Basic embedding generation
    print("\n1️⃣ Testing basic embedding generation...")
    try:
        sample_text = test_data[0]["query"]
        response = client.embeddings.create(input=sample_text, model=model_name)
        embedding = response.data[0].embedding
        print(f"✅ Successfully generated {len(embedding)}-dimensional embedding")
    except Exception as e:
        print(f"❌ Failed to generate embeddings: {e}")
        return False

    # Test 2: Cosine similarity validation
    print("\n2️⃣ Testing cosine similarity patterns...")
    results = test_embedding_quality(client, model_name, test_data[:5])  # Test first 5 examples

    # Test 3: Performance metrics
    print("\n3️⃣ Performance Summary:")
    differences = [r["difference"] for r in results]
    positive_count = sum(1 for d in differences if d > 0)

    print(f"  Positive differences: {positive_count}/{len(differences)}")
    print(f"  Average difference: {np.mean(differences):.4f}")
    print(f"  Min difference: {np.min(differences):.4f}")
    print(f"  Max difference: {np.max(differences):.4f}")

    # Success criteria
    success_rate = positive_count / len(differences)
    avg_diff = np.mean(differences)

    if success_rate >= 0.8 and avg_diff > 0.1:
        print(f"\n✅ Embedding model verification PASSED")
        print(f"   Success rate: {success_rate:.2%}")
        print(f"   Average improvement: {avg_diff:.4f}")
        return True
    else:
        print(f"\n⚠️ Embedding model verification needs attention")
        print(f"   Success rate: {success_rate:.2%} (target: ≥80%)")
        print(f"   Average improvement: {avg_diff:.4f} (target: >0.1)")
        return False

# Usage:
# success = comprehensive_embedding_verification(
#     client=inference_client,
#     model_name="default/my-embedding-model@v1",
#     test_dataset_path="embedding_test.jsonl"
# )

Example Verification Output:

🔍 Verifying embedding model: default/my-embedding-model@v1
============================================================
📊 Loaded 5 test examples

1️⃣ Testing basic embedding generation...
✅ Successfully generated 1024-dimensional embedding

2️⃣ Testing cosine similarity patterns...
Query: machine learning algorithms for natural language processing...
  Positive similarity: 0.8234
  Average negative similarity: 0.3456
  Difference: 0.4778

Query: computer vision techniques for object detection...
  Positive similarity: 0.7891
  Average negative similarity: 0.2987
  Difference: 0.4904

Query: deep learning neural network architectures...
  Positive similarity: 0.8567
  Average negative similarity: 0.3123
  Difference: 0.5444

Query: natural language processing transformers...
  Positive similarity: 0.8012
  Average negative similarity: 0.2765
  Difference: 0.5247

Query: reinforcement learning algorithms...
  Positive similarity: 0.7734
  Average negative similarity: 0.3298
  Difference: 0.4436

3️⃣ Performance Summary:
  Positive differences: 5/5
  Average difference: 0.4962
  Min difference: 0.4436
  Max difference: 0.5444

✅ Embedding model verification PASSED
   Success rate: 100%
   Average improvement: 0.4962

Conclusion#

You have successfully fine-tuned an embedding model using the lora_merged approach and deployed it for inference. The fine-tuned model should show improved performance on your specific domain compared to the base model.

Key achievements:

  • ✅ Fine-tuned embedding model with domain-specific data

  • ✅ Deployed model using Deployment Management Service

  • ✅ Verified embedding quality with cosine similarity testing

  • ✅ Model ready for production use in RAG pipelines

If you included a WandB API key, you can view your training results at wandb.ai under the nvidia-nemo-customizer project.

Note

The W&B integration is optional. When enabled, we’ll send training metrics to W&B using your API key. While we encrypt your API key and don’t log it internally, please review W&B’s terms of service before use.

Next Steps#

  • Learn how to check customization job metrics using the job ID

  • Integrate your fine-tuned embedding model into RAG applications

  • Compare performance against the base model using your evaluation datasets

  • Consider fine-tuning additional embedding models for different domains