Llama (Bidirectional) for Embedding

View as Markdown

NeMo AutoModel provides a bidirectional variant of Meta’s Llama for embedding and dense retrieval tasks. Unlike the standard causal (left-to-right) Llama used for text generation, this variant uses bidirectional attention, so each token can attend to both past and future tokens in the sequence, producing richer representations for semantic similarity and dense retrieval.

For the cross-encoder variant, see Llama (Bidirectional) for Reranking.

TasksEmbedding, Dense Retrieval
ArchitectureLlamaBidirectionalModel
Parameters1B – 8B
HF Orgmeta-llama

Available Models

Any Llama checkpoint can be loaded as a bidirectional backbone. The following configurations are tested:

  • Llama 3.2 1B — fast iteration, fits on a single GPU
  • Llama 3.1 8B — higher-quality embeddings for production use

Embedding Models

The bidirectional bi-encoder path is used for embedding generation and dense retrieval.

ArchitectureTaskAuto ClassDescription
LlamaBidirectionalModelEmbeddingNeMoAutoModelBiEncoderBidirectional Llama with mean pooling for dense embeddings

Pooling Strategies

The bi-encoder supports multiple pooling strategies to aggregate token representations into a single embedding vector:

StrategyDescription
avgAverage of all token hidden states (default)
clsFirst token hidden state
lastLast non-padding token hidden state
weighted_avgWeighted average of token hidden states

Example HF Models

ModelHF ID
Llama 3.2 1Bmeta-llama/Llama-3.2-1B
Llama 3.1 8Bmeta-llama/Llama-3.1-8B

Example Recipes

RecipeDescription
llama3_2_1b.yamlBi-encoder — Llama 3.2 1B embedding model
llama_embed_nemotron_8b.yamlBi-encoder — reproduction recipe for nvidia/llama-embed-nemotron-8b (uses nvidia/embed-nemotron-dataset-v1)

Try with NeMo AutoModel

1. Install NeMo AutoModel. Refer to the (Installation Guide) for information:

$uv pip install nemo-automodel

2. Clone the repo to get the example recipes:

$git clone https://github.com/NVIDIA-NeMo/Automodel.git
$cd Automodel

3. Run the recipe from inside the repo:

$torchrun --nproc-per-node=8 examples/retrieval/bi_encoder/finetune.py --config examples/retrieval/bi_encoder/llama3_2_1b.yaml

1. Pull the container and mount a checkpoint directory:

$docker run --gpus all -it --rm \
> --shm-size=8g \
> -v $(pwd)/checkpoints:/opt/Automodel/checkpoints \
> nvcr.io/nvidia/nemo-automodel:26.04.00

2. Navigate to the AutoModel directory (where the recipes are):

$cd /opt/Automodel

3. Run the recipe:

$torchrun --nproc-per-node=8 examples/retrieval/bi_encoder/finetune.py --config examples/retrieval/bi_encoder/llama3_2_1b.yaml

See the Installation Guide.

Hugging Face Model Card

NVIDIA trained and released the Llama Nemotron Embedding 1B model, which leverages a bidirectional attention mechanism for multilingual and cross-lingual question–answer retrieval. The model supports long documents (up to 8,192 tokens) and dynamic embedding sizes via Matryoshka embeddings. For more details, see the model card on Hugging Face.