Embedding Models#

This page provides detailed technical specifications for the embedding model family supported by NeMo Customizer. For information about supported features and capabilities, refer to Tested Models.

Llama 3.2 NV EmbedQA 1B v2#

Property	Value
Creator	NVIDIA
Architecture	transformer
Description	Llama 3.2 NV EmbedQA 1B v2 is a specialized embedding model optimized for question-answering and retrieval tasks.
Max Sequence Length	2048
Parameters	1 billion
Training Data	Specialized QA and retrieval datasets
Recommended GPUs for Customization	1
Default Name	nvidia/llama-3.2-nv-embedqa-1b-v2
Base Model	nvidia/llama-3.2-nv-embedqa-1b-v2
NGC Model URI	`ngc://nvidia/nemo/llama-3_2-1b-embedding-base:0.0.1`

Customization Target Configuration#

The following configuration is used for the customization target:

Configuration	Value
Namespace	nvidia
Name	llama-3.2-nv-embedqa-1b@v2
Model Path	llama32_1b-embedding
Base Model	nvidia/llama-3.2-nv-embedqa-1b-v2
Number of Parameters	1,000,000,000
Precision	bf16-mixed
Enabled	Configurable (default: false)

Training Configuration#

The model supports the following training configuration:

Training Option	Value
Training Type	SFT (Supervised Fine-Tuning)
Fine-tuning Type	All Weights
Number of GPUs	1
Number of Nodes	1
Tensor Parallel Size	4
Micro Batch Size	8
Max Sequence Length	2048
Prompt Template	`{prompt} {completion}`

Resource Requirements#

Minimum GPU Memory: 8GB
Recommended GPU: A100
Training Time: Varies based on dataset size and epochs

Hyperparameter and Data Recommendations#

This fine-tuning recipe supports full fine-tuning, updating all 1 billion parameters, and requires careful hyperparameter and data selection to prevent overfitting.

The following table provides conservative hyperparameter defaults specifically optimized to prevent overfitting for embedding models:

Parameter	API Field Name	Type	Description	Recommended Value
Learning Rate	`learning_rate`	`number`	Step size for updating model parameters. Lower values help prevent overfitting in embedding models.	`5e-6`
Weight Decay	`weight_decay`	`number`	Regularization parameter to prevent overfitting by penalizing large weights.	`0.01`
Number of Epochs	`epochs`	`integer`	Number of complete passes through the training dataset. Limited to prevent overfitting.	`1`
Training Data Size	N/A	N/A	Number of training examples to prevent overfitting while maintaining model performance.	`5,000-10,000` examples

NVIDIA recommends evaluating fine-tuned embedding models against the baseline to detect overfitting and potential performance degradation.

Deployment with NIM#

This model supports inference deployment through NVIDIA Inference Microservices (NIM). To deploy this model for inference:

Deploy using Deployment Management Service: Follow the Deploy NIM guide to deploy the base model
Access through NIM Proxy: Once deployed, the model can be accessed through the NIM Proxy service
Fine-tuned models: After customization, the fine-tuned model can be deployed following the same NIM deployment process

Note

The embedding model requires specific NIM container images that support embedding inference. Refer to the NIM compatibility matrix for supported image versions.

Model Name Mapping#

When using this model, note the following name mappings:

API/External Name: nvidia/llama-3.2-nv-embedqa-1b-v2 (used in inference requests and external documentation)

Example Usage#

After fine-tuning and deployment, you can use the model for embedding tasks:

# Example inference call through NIM Proxy
response = requests.post(
    f"{nim_proxy_url}/v1/embeddings",
    headers={"Content-Type": "application/json"},
    json={
        "model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
        "input": ["What is the capital of France?"],
        "encoding_format": "float"
    }
)

For detailed fine-tuning instructions, refer to the Fine-tuning tutorial.

For more information about formatting training datasets for the embedding model, refer to Dataset Format Requirements.