Biencoder Retrieval Dataset (Embedding Fine-tuning)#
NeMo Automodel supports biencoder/embedding model fine-tuning using a retrieval-style dataset: each training example is a query paired with one positive document and one or more negative documents.
This dataset is used by the biencoder recipes (see examples/biencoder/) together with the RetrievalBiencoderCollator.
What the Biencoder Consumes#
The dataset factory nemo_automodel.components.datasets.llm.make_retrieval_dataset returns a Hugging Face datasets.Dataset. At runtime it transforms each raw record into the training-time schema:
question: query stringdoc_text: list of document texts in the order[positive, negative_1, negative_2, ...]doc_image: list of images (or empty strings), aligned withdoc_textquery_instruction/passage_instruction: optional, used whenuse_dataset_instruction: trueand the corpus provides instructions via metadata
Supported Input Formats#
NeMo Automodel supports two input schemas:
Corpus ID-Based JSON (Merlin/NeMo-Retriever Style)#
This is the format used by NeMo retriever pipelines where documents live in a separate corpus and training examples reference documents by ID.
Training file example (single JSON):
{
"corpus": [
{ "path": "/abs/path/to/wiki_corpus" }
],
"data": [
{
"question_id": "q_001",
"question": "Explain transformers",
"corpus_id": "wiki_corpus",
"pos_doc": [{ "id": "d_123" }],
"neg_doc": [{ "id": "d_456" }, "d_789"]
}
]
}
Corpus requirements
Each corpus directory must contain a merlin_metadata.json file.
Minimal example:
{ "class": "TextQADataset", "corpus_id": "wiki_corpus" }
Note
pos_docandneg_doccan be lists of{"id": ...}dicts or raw IDs (they are normalized internally).If you set
use_dataset_instruction: true, optional fields likequery_instructionandpassage_instructioninmerlin_metadata.jsonare surfaced to the collator.
Inline-Text JSONL (No Corpus Required)#
This is convenient for custom fine-tuning pipelines where the documents are included inline.
JSONL example (one example per line):
{"query":"Explain transformers","pos_doc":"Transformers are a type of neural network...","neg_doc":["RNNs are...","CNNs are..."]}
{"query":"What is Python?","pos_doc":["A programming language."],"neg_doc":"A snake."}
Note
queryis accepted (questionis also accepted as an alias).pos_docandneg_doccan be either:strings (interpreted as document text), or
lists of strings, or
dicts with at least
text(optionallyimage,nr_ocr) for multimodal use cases.
If
corpus_idis not provided, it defaults to__inline__.use_dataset_instruction: truehas no effect for pure inline records (instructions come from corpus metadata).
YAML Usage (Dataset + Collator)#
Use the dataset factory plus the biencoder collator:
dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
dataset:
_target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
data_dir_list:
- /abs/path/to/train.jsonl # or train.json (corpus-id format)
data_type: train
train_n_passages: 5 # 1 positive + 4 negatives
do_shuffle: true
use_dataset_instruction: false
collate_fn:
_target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
q_max_len: 512
p_max_len: 512
query_prefix: "query:"
passage_prefix: "passage:"
pad_to_multiple_of: 8
Requirements#
pos_docmust be non-empty.If training requests negatives (e.g.,
train_n_passages > 1),neg_docmust contain at least one document (the loader will cycle negatives if you provide fewer than needed).