Llama Embedding#

The Llama Embedding Model is a variant of Llama designed to generate fixed-size sentence embeddings. Unlike the original Llama models, which are designed for NLP tasks such as text generation, translation, and summarization, the Llama Embedding Model fine-tunes the Llama model to produce meaningful, dense representations of entire sentences. This is achieved by replacing the causal masking mechanism with bidirectional attention. These embeddings can be used for tasks like sentence similarity, clustering, and information retrieval, where comparing sentence-level meaning is key. The Llama Embedding Model improves efficiency by enabling fast, scalable comparisons of sentences using cosine similarity or other distance metrics.

NeMo 2.0 Fine-Tuning Recipes#

Note

The fine-tuning recipes use the SpecterDataModule for the data argument. You can replace the SpecterDataModule with your custom dataset.

Alternatively, you can use the CustomRetrievalDataModule for loading your own dataset. Details are provided in the later sections.

To import the Hugging Face model and convert it to the NeMo 2.0 format, run the following command (this only needs to be done once):

from nemo.collections import llm
llm.import_ckpt(model=llm.LlamaEmbeddingModel(llm.Llama32EmbeddingConfig1B()), source='hf://meta-llama/Llama-3.2-1B')

We provide an example below on how to invoke the default recipe and override the data argument:

from nemo.collections import llm

recipe = llm.recipes.llama_embedding_1b.finetune_recipe(
    name="llama_embed_model_finetuning",
    resume_path="path/to/original/ckpt"
    num_nodes=1,
    num_gpus_per_node=8,
)

# # To override the data argument
# dataloader = a_function_that_configures_your_custom_dataset(
#     gbs=gbs,
#     mbs=mbs,
#     seq_length=recipe.model.config.seq_length,
# )
# recipe.data = dataloader

Note

The configuration in the recipes is done using the NeMo-Run run.Config and run.Partial configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.

We provide a customized dataset class specifically for RAG training. To load your customized dataset, provide a .json file with query, pos_doc_key, and neg_doc_key as shown below:

from nemo.collections import llm


dataloader = run.Config(
    llm.CustomRetrievalDataModule,
    data_root='path/to/json/data',
    dataset_identifier='identifier to store the dataset',
    seq_length=sequence_length,
    micro_batch_size=micro_batch_size,
    global_batch_size=global_batch_size,
    tokenizer=tokenizer,
    num_workers=num_dataloader_workers,
    dataset_kwargs=model.get_dataset_kwargs(),
)
# # To override the data argument
recipe.data = dataloader

Once your final configuration is ready, you can execute it on any of the NeMo-Run supported executors. The simplest option is the local executor, which runs the pretraining locally in a separate process. You can use it as follows:

import nemo_run as run

run.run(recipe, executor=run.LocalExecutor())

Additionally, you can also run it directly in the same Python process as follows:

run.run(recipe, direct=True)

Once training is finished, you can optionally convert the model to a Hugging Face model using the script below:

from pathlib import Path

llm.export_ckpt(
    path = Path('path/to/finetuned/ckpt'),
    target='hf',
    output_path=Path('path/to/converted/hf/ckpt'),
)

For a detailed walkthrough of the finetuning process, you can follow the: Finetuning Llama 3.2 Model into Embedding Model tutorial, which provides step-by-step instructions and examples.

Recipe	Status
Llama3.2 1B Embedding Model	Yes