Start an Embedding Model Customization Job#

Learn how to use the NeMo Microservices Platform to create a LoRA merged customization job for embedding models using a custom dataset. In this tutorial, we’ll fine-tune the NVIDIA Llama 3.2 NV EmbedQA 1B model for improved question-answering and retrieval tasks.

Embedding models are specialized for semantic search, document retrieval, question-answering systems, and RAG (Retrieval-Augmented Generation) pipelines. The lora_merged approach combines the efficiency of LoRA training with the deployment simplicity of full-weight models.

Note

The time to complete this tutorial is approximately 60 minutes. In this tutorial, you run a customization job and deploy the resulting model. Job duration increases with the dataset size and number of epochs.

Prerequisites#

Platform Prerequisites#

NeMo Customizer Prerequisites#

Tutorial-Specific Prerequisites#

Access to the NeMo Deployment Management for model deployment
Administrator access to enable the embedding model target for nvidia/llama-3.2-nv-embedqa-1b@v2 (if not already enabled)

Enable Embedding Model Target#

The embedding model target is enabled by default. However, if disabled, contact your administrator to enable the nvidia/llama-3.2-nv-embedqa-1b@v2 target and add the lora_merged training option to the configuration.

Tip

For guidance on requesting access to disabled configurations, including example request templates, refer to the Understanding NeMo Customizer Configurations and Models tutorial.

# Enable the embedding model target
nvidia/llama-3.2-nv-embedqa-1b@v2:
  enabled: true

# Add lora_merged training option to the configuration if not included in your deployment.
nvidia/llama-3.2-nv-embedqa-1b@v2+A100:
  training_options:
    - training_type: sft
      finetuning_type: lora_merged
      num_gpus: 1
      num_nodes: 1
      tensor_parallel_size: 1
      micro_batch_size: 1

Select Model#

Find Available Embedding Configs#

First, we need to identify if the embedding model configuration is available and supports lora_merged training.

Note

The /customization/configs endpoint returns only enabled configurations by default. If you don’t see the embedding model configuration in the results, verify that your administrator has enabled the nvidia/llama-3.2-nv-embedqa-1b@v2 target as described in the Enable Embedding Model Target section above.

Get embedding model configurations:

Python SDK

from nemo_microservices import NeMoMicroservices
import os

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['CUSTOMIZER_BASE_URL']
)

# Find embedding model configurations that support lora_merged
configs = client.customization.configs.list(
    filter={"finetuning_type": "lora_merged"}
)

print(f"Found {len(configs.data)} embedding model configuration(s) with lora_merged support")

if len(configs.data) == 0:
    print("\n⚠️  No embedding configurations found.")
    print("This typically means the embedding model target is not enabled.")
    print("Contact your administrator to enable the 'nvidia/llama-3.2-nv-embedqa-1b@v2' target.")
else:
    for config in configs.data:
        print(f"\nConfig: {config.name}")
        print(f"  Training options: {len(config.training_options)}")
        for option in config.training_options:
            print(f"    - {option.training_type}/{option.finetuning_type}: {option.num_gpus} GPUs")
            if option.finetuning_type == "lora_merged":
                print(f"      ✓ Supports LoRA merged training")

cURL

# Note: This endpoint returns only enabled configurations by default.
# If you don't see results, the embedding model target may not be enabled.
curl -X GET "${CUSTOMIZER_BASE_URL}/v1/customization/configs?filter%5Bfinetuning_type%5D=lora_merged" \
  -H 'Accept: application/json' | jq

Review the response to confirm the embedding model is available:

Example Response

{
  "object": "list",
  "data": [
    {
      "name": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
      "namespace": "default",
      "dataset_schemas": [
        {
          "title": "Newline-Delimited JSON File",
          "type": "array",
          "items": {
            "description": "Schema for Supervised Fine-Tuning (SFT) training data items.",
            "properties": {
              "prompt": {
                "description": "The prompt for the entry",
                "title": "Prompt",
                "type": "string"
              },
              "completion": {
                "description": "The completion to train on",
                "title": "Completion",
                "type": "string"
              }
            },
            "required": ["prompt", "completion"],
            "title": "SFTDatasetItemSchema",
            "type": "object"
          }
        }
      ],
      "training_options": [
        {
          "training_type": "sft",
          "finetuning_type": "lora",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        },
        {
          "training_type": "sft",
          "finetuning_type": "all_weights",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        }
      ]
    },
    {
      "name": "nvidia/llama-3.2-nv-embedqa-1b@v2+A100",
      "namespace": "nvidia",
      "dataset_schemas": [
        {
          "title": "Newline-Delimited JSON File",
          "type": "array",
          "items": {
            "description": "Schema for embedding training data items.",
            "properties": {
              "query": {
                "description": "The query to use as an anchor",
                "title": "Query",
                "type": "string"
              },
              "pos_doc": {
                "description": "A document that should match positively with the anchor",
                "title": "Positive Document",
                "type": "string"
              },
              "neg_doc": {
                "description": "Documents that should not match with the anchor",
                "title": "Negative Documents",
                "type": "array",
                "items": {"type": "string"}
              }
            },
            "required": ["query", "pos_doc", "neg_doc"],
            "title": "EmbeddingDatasetItemSchema",
            "type": "object"
          }
        }
      ],
      "training_options": [
        {
          "training_type": "sft",
          "finetuning_type": "lora_merged",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        }
      ]
    }
  ]
}

The response should show the nvidia/llama-3.2-nv-embedqa-1b@v2+A100 configuration with lora_merged training support.

Review Dataset Schema#

Embedding models require a specific triplet dataset format for contrastive learning, different from standard text generation models.

Create Datasets#

Prepare Embedding Dataset#

Embedding models require training data in a triplet format with query, positive document, and negative documents for contrastive learning.

Create embedding training data files:

# Example embedding dataset format (triplet structure)
import json

# Create example embedding training data
embedding_training_data = [
    {
        "query": "3D trajectory recovery for tracking multiple objects",
        "pos_doc": "Recursive Estimation of Motion, Structure, and Focal Length",
        "neg_doc": ["Characterization of 1.2 kV SiC super-junction SBD implemented by trench technique"]
    },
    {
        "query": "machine learning algorithms for natural language processing",
        "pos_doc": "Deep learning approaches to text classification and sentiment analysis",
        "neg_doc": ["Quantum computing applications in cryptography", "Solar panel efficiency optimization methods"]
    },
    {
        "query": "computer vision techniques for object detection",
        "pos_doc": "YOLO and R-CNN architectures for real-time object recognition",
        "neg_doc": ["Database indexing strategies for large datasets", "Network security protocols and implementations"]
    }
]

# Save to JSONL format
with open("embedding_train.jsonl", "w") as f:
    for item in embedding_training_data:
        f.write(json.dumps(item) + "\n")

print("Created embedding training dataset with triplet format")

Create validation data using the same format with different examples:

# Create validation dataset using the same format
embedding_validation_data = [
    {
        "query": "neural network architectures for image classification",
        "pos_doc": "Convolutional neural networks for computer vision tasks",
        "neg_doc": ["Database management system optimization", "Compiler design principles"]
    },
    {
        "query": "distributed systems and scalability",
        "pos_doc": "Microservices architecture patterns for cloud deployment",
        "neg_doc": ["Mobile app UI design patterns", "Photography composition techniques"]
    }
]

# Save validation data to JSONL format
with open("embedding_validation.jsonl", "w") as f:
    for item in embedding_validation_data:
        f.write(json.dumps(item) + "\n")

print("Created embedding validation dataset")

Upload Training Data#

The complete_dataset_upload function handles the complete workflow:

Creates namespaces in both Entity Store and Datastore
Registers the dataset in Entity Store
Creates the dataset repository in Datastore
Uploads training and validation files

# Upload embedding dataset with correct filenames
# This function automatically:
#   1. Creates namespaces in Entity Store and Datastore
#   2. Registers the dataset in Entity Store
#   3. Creates dataset repository in Datastore
#   4. Uploads training and validation files
repo_id = complete_dataset_upload(
    entity_host=os.environ['ENTITY_HOST'],
    ds_host=os.environ['DS_HOST'],
    namespace="default",
    dataset_name="embedding-dataset",
    training_file="embedding_train.jsonl",
    validation_file="embedding_validation.jsonl",
    description="Embedding model training dataset"
)
print(f"✅ Dataset uploaded successfully: {repo_id}")
print(f"✅ Dataset registered in Entity Store as: default/embedding-dataset")

Note

Entity Store Registration: This function automatically registers the dataset in Entity Store using entity_client.datasets.create(). You don’t need to register the dataset separately—both Entity Store and Datastore registration are handled in a single workflow.

Checkpoint

At this point, we’ve:

✅ Created namespaces in Entity Store and Datastore
✅ Registered the dataset in Entity Store
✅ Uploaded training and validation files to Datastore
✅ Ready to create our customization job

Start Model Customization Job#

Important

The config field must include a version, for example: nvidia/llama-3.2-nv-embedqa-1b@v2+A100. Omitting the version will result in an error.

You can find the correct config URN (with version) by inspecting the output of the /customization/configs endpoint.

Set Hyperparameters#

For embedding model training with lora_merged, we use specific hyperparameters optimized for contrastive learning:

# Hyperparameters for lora_merged embedding training
hyperparameters = {
    "training_type": "sft",
    "finetuning_type": "lora_merged",  # Key difference for embedding models
    "epochs": 3,
    "batch_size": 8,
    "learning_rate": 5e-5,
    "lora": {
        "adapter_dim": 16,
        "alpha": 32,
        "adapter_dropout": 0.01
    }
}

print("Hyperparameters for lora_merged training:")
print(json.dumps(hyperparameters, indent=2))

Create and Submit Training Job#

Create the embedding model customization job:

Python SDK

from nemo_microservices import NeMoMicroservices
import os

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['CUSTOMIZER_BASE_URL']
)

# Set up WandB API key for enhanced visualization
extra_headers = {}
if os.getenv('WANDB_API_KEY'):
    extra_headers['wandb-api-key'] = os.getenv('WANDB_API_KEY')

# Create embedding model customization job
job = client.customization.jobs.create(
    config="nvidia/llama-3.2-nv-embedqa-1b@v2+A100",  # Embedding model config
    dataset={
        "namespace": "default",
        "name": "embedding-dataset"
    },
    hyperparameters={
        "training_type": "sft",
        "finetuning_type": "lora_merged",
        "epochs": 3,
        "batch_size": 8,
        "learning_rate": 5e-5,
        "lora": {
            "adapter_dim": 16,
            "alpha": 32,
            "adapter_dropout": 0.01
        }
    },
    output_model="default/my-embedding-model@v1",
    extra_headers=extra_headers
)

print(f"Created embedding job:")
print(f"  Job ID: {job.id}")
print(f"  Status: {job.status}")
print(f"  Output model: {job.output_model}")

cURL

# Create embedding model customization job
curl -X "POST" \
  "${CUSTOMIZER_BASE_URL}/v1/customization/jobs" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H "wandb-api-key: ${WANDB_API_KEY}" \
  -d '{
    "config": "nvidia/llama-3.2-nv-embedqa-1b@v2+A100",
    "dataset": {"namespace": "default", "name": "embedding-dataset"},
    "hyperparameters": {
      "training_type": "sft",
      "finetuning_type": "lora_merged",
      "epochs": 3,
      "batch_size": 8,
      "learning_rate": 5e-5,
      "lora": {
        "adapter_dim": 16,
        "alpha": 32,
        "adapter_dropout": 0.01
      }
    },
    "output_model": "default/my-embedding-model@v1"
  }' | jq

Review the response:

Copy the following values from the response:
- id (Job ID)
- output_model

Monitor Job Status#

Monitor the training progress using the job ID:

import time

def monitor_job_progress(client, job_id, check_interval=60):
    """Monitor job progress with comprehensive status handling"""
    while True:
        job_details = client.customization.jobs.retrieve(job_id=job_id)
        
        print(f"Job status: {job_details.status}")
        
        if job_details.status == "completed":
            print("Training completed successfully!")
            break
        elif job_details.status in ["failed", "cancelled"]:
            print(f"Training finished with status: {job_details.status}")
            if job_details.status == "failed":
                print("Check the job logs for error details.")
            break
        elif job_details.status in ["created", "pending"]:
            print("Job is queued and waiting to start...")
        elif job_details.status == "running":
            print("Training is in progress...")
            # Show progress if available
            if hasattr(job_details, 'status_details') and job_details.status_details:
                if hasattr(job_details.status_details, 'percentage_done'):
                    print(f"Progress: {job_details.status_details.percentage_done:.1f}%")
        elif job_details.status == "cancelling":
            print("Job is being cancelled...")
        elif job_details.status in ["ready", "unknown"]:
            print(f"Job finished with status: {job_details.status}")
            break
        else:
            print(f"Unknown status: {job_details.status}")
        
        time.sleep(check_interval)
    
    return job_details

# Usage: final_job_status = monitor_job_progress(client, job_id)

Deploy the Model#

Once the job finishes, the lora_merged model behaves like a full-weight model and requires deployment through the Deployment Management Service.

Important

Unlike regular LoRA adapters, lora_merged models require a dedicated NIM deployment similar to full SFT models. The LoRA weights are merged into the base model during training, creating a complete model artifact.

Deploy Using Deployment Management Service#

import time
from openai import OpenAI

# Assuming output_model_urn is available from the completed job
# output_model_urn = "default/my-embedding-model@v1" # Replace with actual output model URN

# Create deployment configuration
deployment_config = client.deployment.configs.create(
    name="embedding-deploy-config",
    namespace="default",
    description="Configuration for fine-tuned embedding model deployment",
    model=output_model_urn,
    nim_deployment={
        "image_name": "nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2", # Specific NIM image for embedding models
        "image_tag": "1.10.0", # Use appropriate NIM version
        "gpu": 1 # Adjust based on model size and config
    }
)
print(f"Created deployment config: {deployment_config.name}")

# Deploy the fine-tuned model using the configuration
deployment = client.deployment.model_deployments.create(
    name="embedding-model-deployment",
    namespace="default",
    config="default/embedding-deploy-config"
)
print(f"Created deployment: {deployment.name}")
print(f"Status: {deployment.status_details.status}")

# Monitor deployment status with polling
def wait_for_deployment(client, deployment_name, namespace="default", timeout=1200):
    start_time = time.time()
    while True:
        if time.time() - start_time > timeout:
            raise RuntimeError(f"Deployment timeout after {timeout} seconds")
        deployment_status = client.deployment.model_deployments.retrieve(
            deployment_name, namespace=namespace
        )
        status = deployment_status.status_details.status
        elapsed = time.time() - start_time
        print(f"Deployment status: {status} after {elapsed:.1f}s")
        if status == "ready":
            print("✅ Deployment completed successfully!")
            break
        elif status in ["failed", "cancelled"]:
            raise RuntimeError(f"Deployment {status}")
        time.sleep(10)
    return deployment_status

final_deployment_status = wait_for_deployment(client.deployment, "embedding-model-deployment", namespace="default")
print(f"Model deployed as: {final_deployment_status.models}")

# Test using OpenAI-compatible client
openai_client = OpenAI(
    base_url=f"{NIM_PROXY_BASE_URL}/v1",
    api_key="not-used"
)

# Example inference for embedding model
response = openai_client.embeddings.create(
    model=output_model_urn,
    input=["This is a test query.", "Another sentence to embed."]
)
print("✅ Model inference successful!")
print(f"Response: {response.data[0].embedding[:5]}...") # Print first 5 dimensions

Verify Embeddings#

After deployment, verify that your fine-tuned embedding model is working correctly:

Basic Embedding Generation Test#

# Simple embedding verification using the SDK
from nemo_microservices import NeMoMicroservices

# Initialize client for inference
inference_client = NeMoMicroservices(
    inference_base_url=os.environ['NIM_PROXY_BASE_URL']
)

# Test embedding generation
test_texts = [
    "machine learning algorithms",
    "deep learning neural networks",
    "quantum computing applications"
]

print("Testing embedding generation:")
for text in test_texts:
    try:
        response = inference_client.embeddings.create(
            input=text,
            model="default/my-embedding-model@v1"
        )
        embedding = response.data[0].embedding
        print(f"✅ Generated embedding for '{text}': {len(embedding)} dimensions")
    except Exception as e:
        print(f"❌ Error generating embedding for '{text}': {e}")

Comprehensive Embedding Verification#

For thorough verification, test the embedding quality using cosine similarity:

def comprehensive_embedding_verification(client, model_name, test_dataset_path):
    """Comprehensive embedding model verification"""

    print(f"🔍 Verifying embedding model: {model_name}")
    print("=" * 60)

    # Load test dataset
    test_data = []
    with open(test_dataset_path, "r") as f:
        for line in f:
            test_data.append(json.loads(line))

    print(f"📊 Loaded {len(test_data)} test examples")

    # Test 1: Basic embedding generation
    print("\n1️⃣ Testing basic embedding generation...")
    try:
        sample_text = test_data[0]["query"]
        response = client.embeddings.create(input=sample_text, model=model_name)
        embedding = response.data[0].embedding
        print(f"✅ Successfully generated {len(embedding)}-dimensional embedding")
    except Exception as e:
        print(f"❌ Failed to generate embeddings: {e}")
        return False

    # Test 2: Cosine similarity validation
    print("\n2️⃣ Testing cosine similarity patterns...")
    results = test_embedding_quality(client, model_name, test_data[:5])  # Test first 5 examples

    # Test 3: Performance metrics
    print("\n3️⃣ Performance Summary:")
    differences = [r["difference"] for r in results]
    positive_count = sum(1 for d in differences if d > 0)

    print(f"  Positive differences: {positive_count}/{len(differences)}")
    print(f"  Average difference: {np.mean(differences):.4f}")
    print(f"  Min difference: {np.min(differences):.4f}")
    print(f"  Max difference: {np.max(differences):.4f}")

    # Success criteria
    success_rate = positive_count / len(differences)
    avg_diff = np.mean(differences)

    if success_rate >= 0.8 and avg_diff > 0.1:
        print(f"\n✅ Embedding model verification PASSED")
        print(f"   Success rate: {success_rate:.2%}")
        print(f"   Average improvement: {avg_diff:.4f}")
        return True
    else:
        print(f"\n⚠️ Embedding model verification needs attention")
        print(f"   Success rate: {success_rate:.2%} (target: ≥80%)")
        print(f"   Average improvement: {avg_diff:.4f} (target: >0.1)")
        return False

# Usage:
# success = comprehensive_embedding_verification(
#     client=inference_client,
#     model_name="default/my-embedding-model@v1",
#     test_dataset_path="embedding_test.jsonl"
# )

Example Verification Output:

🔍 Verifying embedding model: default/my-embedding-model@v1
============================================================
📊 Loaded 5 test examples

1️⃣ Testing basic embedding generation...
✅ Successfully generated 1024-dimensional embedding

2️⃣ Testing cosine similarity patterns...
Query: machine learning algorithms for natural language processing...
  Positive similarity: 0.8234
  Average negative similarity: 0.3456
  Difference: 0.4778

Query: computer vision techniques for object detection...
  Positive similarity: 0.7891
  Average negative similarity: 0.2987
  Difference: 0.4904

Query: deep learning neural network architectures...
  Positive similarity: 0.8567
  Average negative similarity: 0.3123
  Difference: 0.5444

Query: natural language processing transformers...
  Positive similarity: 0.8012
  Average negative similarity: 0.2765
  Difference: 0.5247

Query: reinforcement learning algorithms...
  Positive similarity: 0.7734
  Average negative similarity: 0.3298
  Difference: 0.4436

3️⃣ Performance Summary:
  Positive differences: 5/5
  Average difference: 0.4962
  Min difference: 0.4436
  Max difference: 0.5444

✅ Embedding model verification PASSED
   Success rate: 100%
   Average improvement: 0.4962

Conclusion#

You have successfully fine-tuned an embedding model using the lora_merged approach and deployed it for inference. The fine-tuned model should show improved performance on your specific domain compared to the base model.

Key achievements:

✅ Fine-tuned embedding model with domain-specific data
✅ Deployed model using Deployment Management Service
✅ Verified embedding quality with cosine similarity testing
✅ Model ready for production use in RAG pipelines

If you included a WandB API key, you can view your training results at wandb.ai under the nvidia-nemo-customizer project.

Note

The W&B integration is optional. When enabled, we’ll send training metrics to W&B using your API key. While we encrypt your API key and don’t log it internally, please review W&B’s terms of service before use.

Next Steps#

Learn how to check customization job metrics using the job ID
Integrate your fine-tuned embedding model into RAG applications
Compare performance against the base model using your evaluation datasets
Consider fine-tuning additional embedding models for different domains