> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# Llama (Bidirectional) for Embedding

NeMo AutoModel provides a bidirectional variant of [Meta's Llama](https://www.llama.com/) for embedding and dense retrieval tasks. Unlike the standard causal (left-to-right) Llama used for text generation, this variant uses **bidirectional attention**, so each token can attend to both past and future tokens in the sequence, producing richer representations for semantic similarity and dense retrieval.

For the cross-encoder variant, see [Llama (Bidirectional) for Reranking](/model-coverage/reranking-models/llama-bidirectional).

|                  |                                                 |
| ---------------- | ----------------------------------------------- |
| **Tasks**        | Embedding, Dense Retrieval                      |
| **Architecture** | `LlamaBidirectionalModel`                       |
| **Parameters**   | 1B – 8B                                         |
| **HF Org**       | [meta-llama](https://huggingface.co/meta-llama) |

## Available Models

Any Llama checkpoint can be loaded as a bidirectional backbone. The following configurations are tested:

* **Llama 3.2 1B** — fast iteration, fits on a single GPU
* **Llama 3.1 8B** — higher-quality embeddings for production use

## Embedding Models

The bidirectional bi-encoder path is used for embedding generation and dense retrieval.

| Architecture              | Task      | Auto Class                                                                                                                                                         | Description                                                |
| ------------------------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------- |
| `LlamaBidirectionalModel` | Embedding | [`NeMoAutoModelBiEncoder`](https://github.com/NVIDIA-NeMo/Automodel/blob/8dc00dcb4a35c2413c52c6e7eb7ac8f1c24836aa/nemo_automodel/_transformers/auto_model.py#L991) | Bidirectional Llama with mean pooling for dense embeddings |

## Pooling Strategies

The bi-encoder supports multiple pooling strategies to aggregate token representations into a single embedding vector:

| Strategy       | Description                                  |
| -------------- | -------------------------------------------- |
| `avg`          | Average of all token hidden states (default) |
| `cls`          | First token hidden state                     |
| `last`         | Last non-padding token hidden state          |
| `weighted_avg` | Weighted average of token hidden states      |

## Example HF Models

| Model        | HF ID                                                                       |
| ------------ | --------------------------------------------------------------------------- |
| Llama 3.2 1B | [`meta-llama/Llama-3.2-1B`](https://huggingface.co/meta-llama/Llama-3.2-1B) |
| Llama 3.1 8B | [`meta-llama/Llama-3.1-8B`](https://huggingface.co/meta-llama/Llama-3.1-8B) |

## Example Recipes

| Recipe                                                                                                                                                                   | Description                                                                                                                                                                                                                                  |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [llama3\_2\_1b.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/retrieval/bi_encoder/llama3_2_1b.yaml)                                                  | Bi-encoder — Llama 3.2 1B embedding model                                                                                                                                                                                                    |
| [llama\_embed\_nemotron\_8b.yaml](https://github.com/NVIDIA-NeMo/Automodel/blob/main/examples/retrieval/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml) | Bi-encoder — reproduction recipe for [`nvidia/llama-embed-nemotron-8b`](https://huggingface.co/nvidia/llama-embed-nemotron-8b) (uses [`nvidia/embed-nemotron-dataset-v1`](https://huggingface.co/datasets/nvidia/embed-nemotron-dataset-v1)) |

## Try with NeMo AutoModel

**1. Install NeMo AutoModel**. Refer to the ([Installation Guide](/get-started/installation)) for information:

```bash
uv pip install nemo-automodel
```

**2. Clone the repo** to get the example recipes:

```bash
git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel
```

**3. Run the recipe** from inside the repo:

```bash
torchrun --nproc-per-node=8 examples/retrieval/bi_encoder/finetune.py --config examples/retrieval/bi_encoder/llama3_2_1b.yaml
```

**1. Pull the container** and mount a checkpoint directory:

```bash
docker run --gpus all -it --rm \
  --shm-size=8g \
  -v $(pwd)/checkpoints:/opt/Automodel/checkpoints \
  nvcr.io/nvidia/nemo-automodel:26.04.00
```

**2. Navigate** to the AutoModel directory (where the recipes are):

```bash
cd /opt/Automodel
```

**3. Run the recipe**:

```bash
torchrun --nproc-per-node=8 examples/retrieval/bi_encoder/finetune.py --config examples/retrieval/bi_encoder/llama3_2_1b.yaml
```

See the [Installation Guide](/get-started/installation).

## Hugging Face Model Card

NVIDIA trained and released the `Llama Nemotron Embedding 1B` model, which leverages a bidirectional attention mechanism for multilingual and cross-lingual question–answer retrieval. The model supports long documents (up to 8,192 tokens) and dynamic embedding sizes via Matryoshka embeddings. For more details, see the model card on Hugging Face.

* [nvidia/llama-nemotron-embed-1b-v2](https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2)
* [nvidia/llama-embed-nemotron-8b](https://huggingface.co/nvidia/llama-embed-nemotron-8b)