GPT Embedding Models

Recent research has shown the feasibility of training embedding models using Decoder-Only (GPT Style) architectures. For example, the paper titled Improving Text Embeddings with Large Language Models is one such recent papers which served as inspiration to implement Decoder-only embedding training in Nemo.

Training a GPT Embedding Model

When training GPT Embedding models, the process is quite similar to SBERT Embedding training. However, there are a few distinctions. For GPT Embedding model training, you’ll need a jsonl file where each line represents a JSON object. Below is a truncated example of the data in a jsonl file:: {“query”: “What did … 1952-2002 period?”, “pos_doc”: “Morning (2008) … has changed little.”, “neg_doc”: “Even though … sapiens.”, “query_id”: “q103151”, “doc_id”: “d14755”} {“query”: “What type of … passions?”, “pos_doc”: “Burke was a leading … upper classes.”, “neg_doc”: “Writing to a friend … Government.”, “query_id”: “q77959”, “doc_id”: “d11263”} {“query”: “Since 1999, … progressed at?”, “pos_doc”: “Commercial solar water … as of 2007.”, “neg_doc”: “The potential solar … acquire.”, “query_id”: “q16545”, “doc_id”: “d1883”}

The json object should contain the following fields: query, pos_doc, neg_doc, query_id and doc_id. The query_id and doc_id fields can be any alphanumeric string that uniquely maps to the query string and pos_doc string.

During training, the GPT Embedding model uses LoRA (by default) to learn embeddings for queries and documents. LoRA maximizes the similarity between query-to-pos_doc pairs while minimizing the similarity between query-to-neg_doc similarity. This approach enables fine-tuning of LLMs like Mistral 7B with a relatively small number of training parameters.

The following example command launches a training job:

python3 /NeMo/examples/nlp/information_retrieval/megatron_gpt_embedding_finetuning.py \
   exp_manager.exp_dir="PATH_TO_SAVE_LORA_WEIGHTS" \
   model.global_batch_size=4 \                         # exact choice for global batch size is data dependent typical values are in the range of 32 to 128.
   model.micro_batch_size=4 \                          # exact choice for micro batch size is GPU memory dependent 2 to 8 are reasonable values.
   trainer.devices=1 \                                 # indicates how many GPUs to use during training per node.
   trainer.num_nodes=1 \                               # indicates how many nodes to use if multi-node cluster is available
   trainer.max_steps=20 \                              # how many training steps to run.
   model.restore_from_path="PATH_TO_BASE_NEMO_MODEL" \
   model.peft.lora_tuning.adapter_dim=16 \             # the low-rank size for lora weights.
   model.data.train_ds.file_names=["train.jsonl"]

The full list of possible run arguments is configurable in /examples/nlp/information_retrieval/conf/megatron_gpt_embedder_tuning_config.yaml. By default, a trained model file should be generated here: PATH_TO_SAVE_LORA_WEIGHTS/megatron_gpt_peft_lora_tuning/checkpoints/, typically with the extension .nemo.

Inference using a GPT Embedding Model

Once trained, the GPT Embedding Model can be used to generate embeddings for queries and corpus documents. You can launch inference using the following command:

python3 /NeMo/examples/nlp/information_retrieval/megatron_gpt_embedding_generate.py \
   model.global_batch_size=4 \
   model.micro_batch_size=4 \
   trainer.devices=1 \
   trainer.num_nodes=1 \
   model.restore_from_path="PATH_TO_BASE_NEMO_MODEL" \  # Same base model used at training time.
   model.peft.restore_from_path="PATH_TO_SAVE_LORA_WEIGHTS/megatron_gpt_peft_lora_tuning/checkpoints//megatron_gpt_peft_lora_tuning.nemo" \
   model.data.test_ds.query_file_names=["test_query.jsonl"] \
   model.data.test_ds.doc_file_names=\["test_docs.jsonl"] \
   model.data.test_ds.write_embeddings_to_file=True \
   model.data.test_ds.output_file_path_prefix="PATH_TO_SAVE_EMEBDDINGS"

The contents of test_queries.jsonl is expected to be in the following format:

{"query": "What do ... quantities?","query_id": "q11600", "doc_id": "d1172"}
{"query": "What are ... subsectors?", "query_id": "q5831", "doc_id": "d577"}
{"query": "Which article ... Government?", "query_id": "q3037", "doc_id": "d336"}

In this context, the doc_id field should contain the identifier of the document or passage that corresponds to the correct answer for the given query. It’s important to note that when in inference mode, query-document pairs are not necessary.

The contents of test_docs.jsonl is expected to be in the following format:

{"pos_doc": "Hormones ... vitamin D.", "doc_id": "d823"}
{"pos_doc": "Historically, Victoria ... October 2016.", "doc_id": "d159"}
{"pos_doc": "Exceptional examples ... Warsaw.", "doc_id": "d1084"}

Once again, we show 3 examples form each file. Typically the test_docs.jsonl will contain more items than queries in the test_queries.jsonl.

The inference command will generate two folders:

PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries
PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_docs

The X in the folder consumed_samplesX represents the number of batches consumed. This is not crucial during testing but is useful during training, as explained in the next section. First, let’s examine the test_queries.

$> ls PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries
query.ids  query.npy
$>head -n3 PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_queries/query.ids
q11600
q5831
q3037

query.npy is a numpy pickled array containing rows of query embeddings and the query.ids text file lists the id of each embedding in the same order.

Similarly, let’s look into the test_docs folder:

$> ls PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_doc/
doc.ids  doc.npy
$> head -n3 PATH_TO_SAVE_EMBEDDINGS/consumed_samplesX/test_doc/doc.ids
d823
d159
d1084

The test_doc has a similar structure to test_queries, but includes ids and embeddings of the documents from the test_docs.jsonl file. With this setup, it is possible to evaluate the performance using metrics like MRR or NDCG.