> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo-platform/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo-platform/_mcp/server.

# Embedding Models

<a id="model-catalog-embedding" />

This page provides detailed technical specifications for the embedding model family supported by NeMo Customizer. For information about supported features and capabilities, refer to [Tested Models](/documentation/customizer-reference/models/model-catalog).

## Llama Nemotron Embedding 1B v2

| Property             | Value                                                                                                                                                                                                                                                                                        |
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Creator              | NVIDIA                                                                                                                                                                                                                                                                                       |
| Architecture         | Transformer encoder (fine-tuned Llama 3.2 1B, 16 layers)                                                                                                                                                                                                                                     |
| Description          | Optimized for **multilingual and cross-lingual** text question-answering retrieval with support for long documents (up to 8192 tokens) and dynamic embedding size (Matryoshka Embeddings). Reduces data storage footprint by 35x through dynamic embedding sizing. Ready for commercial use. |
| Max Sequence Length  | 8192                                                                                                                                                                                                                                                                                         |
| Embedding Dimensions | 2048 (configurable: 384, 512, 768, 1024, or 2048)                                                                                                                                                                                                                                            |
| Parameters           | 1 billion                                                                                                                                                                                                                                                                                    |
| Supported Languages  | English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish (26 languages)                                           |
| Training Data        | Semi-supervised pre-training on 12M samples and fine-tuning on 1M samples from public QA datasets with commercial licenses                                                                                                                                                                   |
| License              | [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/), [Llama 3.2 Community License](https://www.llama.com/llama3_2/license/)                                                                                                  |
| Default Name         | nvidia/llama-nemotron-embed-1b-v2                                                                                                                                                                                                                                                            |
| HuggingFace          | [nvidia/llama-nemotron-embed-1b-v2](https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2)                                                                                                                                                                                                |
| NIM                  | [nvidia/llama-nemotron-embed-1b-v2](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/llama-nemotron-embed-1b-v2)                                                                                                                                                              |

### Model Entity Configuration

Create a Model Entity for this embedding model:

| Configuration        | Value                             |
| -------------------- | --------------------------------- |
| Workspace            | default                           |
| Name                 | llama-nemotron-embed-1b-v2        |
| Base Model           | nvidia/llama-nemotron-embed-1b-v2 |
| Number of Parameters | 1,000,000,000                     |
| Precision            | bf16-mixed                        |

### Training Options

* **LoRA (merged)**: 1x 80GB GPU, tensor parallel size 1
* **Full SFT**: 1x 80GB GPU, tensor parallel size 1

Embedding models only support **merged LoRA** (`peft` with `merge=True`). Unmerged LoRA adapters are not supported because the embedding NIM requires ONNX format, which cannot represent standalone adapters.

### Resource Requirements

* **Minimum GPU Memory**: 80GB
* **Recommended GPU**: A100
* **Training Time**: Varies based on dataset size and epochs

### Hyperparameter and Data Recommendations

This fine-tuning recipe supports full fine-tuning, updating all 1 billion parameters, and requires careful hyperparameter and data selection to prevent overfitting.

The following table provides conservative hyperparameter defaults specifically optimized to prevent overfitting for embedding models:

| Parameter          | API Field Name  | Type      | Description                                                                                         | Recommended Value       |
| ------------------ | --------------- | --------- | --------------------------------------------------------------------------------------------------- | ----------------------- |
| Learning Rate      | `learning_rate` | `number`  | Step size for updating model parameters. Lower values help prevent overfitting in embedding models. | `5e-6`                  |
| Weight Decay       | `weight_decay`  | `number`  | Regularization parameter to prevent overfitting by penalizing large weights.                        | `0.01`                  |
| Number of Epochs   | `epochs`        | `integer` | Number of complete passes through the training dataset. Limited to prevent overfitting.             | `1`                     |
| Training Data Size | N/A             | N/A       | Number of training examples to prevent overfitting while maintaining model performance.             | `5,000-10,000` examples |

NVIDIA recommends evaluating fine-tuned embedding models against the baseline to detect overfitting and potential performance degradation.

### Deployment Configuration

* **Full SFT and LoRA (merged)**:
* NIM Image: `nvcr.io/nim/nvidia/llama-nemotron-embed-1b-v2:1.13.0`
* GPU Count: 1x 80GB

### Deployment and Inference

This model supports inference deployment through NVIDIA Inference Microservices (NIM). After customization, access your model through the **Inference Gateway**:

1. **Deploy the model**: Create a ModelDeploymentConfig and ModelDeployment to deploy your fine-tuned model. See [about](/documentation/models-and-inference) for details.
2. **Access through Inference Gateway**: The Inference Gateway provides unified access to all deployed models via three routing patterns:

* **Model Entity routing**: `/v2/workspaces/{workspace}/inference/gateway/model/{name}/-/v1/embeddings`
* **Provider routing**: `/v2/workspaces/{workspace}/inference/gateway/provider/{deployment}/-/v1/embeddings`

```python
import os
from nemo_platform import NeMoPlatform

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

# Get pre-configured OpenAI client for inference
oai_client = client.models.get_openai_client()
```

The embedding model requires NIM container images that support embedding inference. When the deployment reaches `READY` state, a ModelProvider is automatically created for routing inference requests.

### Example Usage

After fine-tuning and deployment, you can use the model for embedding tasks:

```python
import os
from nemo_platform import NeMoPlatform

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

MODEL_NAME = "llama-nemotron-embed-1b-v2"
DEPLOYMENT_NAME = "my-embedding-deployment"

# Call embeddings via Inference Gateway
response = client.inference.gateway.provider.post(
    "v1/embeddings",
    name=DEPLOYMENT_NAME,
    workspace="default",
    body={
        "model": MODEL_NAME,
        "input": ["What is the capital of France?"],
        "input_type": "query",
    },
)

embedding = response["data"][0]["embedding"]
print(f"Embedding dimension: {len(embedding)}")
```

For detailed fine-tuning instructions, refer to the [Embedding Customization tutorial](../tutorials/embedding-customization-job.ipynb).

For more information about formatting training datasets for the embedding model, refer to [Dataset Format Requirements](/documentation/customizer-reference/models/dataset-format).