Retrieval Dataset (Embedding Fine-tuning)

NeMo Automodel supports retrieval model fine-tuning using a retrieval-style dataset: each training example is a query paired with one positive document and one or more negative documents.

This dataset is used by the retrieval recipes (see examples/retrieval/bi_encoder/ and examples/retrieval/cross_encoder/) together with the BiEncoderCollator.

What the Bi-Encoder Consumes

The dataset factory nemo_automodel.components.datasets.llm.make_retrieval_dataset returns a Hugging Face datasets.Dataset. At runtime it transforms each raw record into the training-time schema:

question: query string
doc_text: list of document texts in the order [positive, negative_1, negative_2, ...]
doc_image: list of images (or empty strings), aligned with doc_text
query_instruction / passage_instruction: optional, used when use_dataset_instruction: true and the corpus provides instructions via metadata

Supported Input Formats

NeMo Automodel supports two input schemas:

Corpus ID-Based JSON (Merlin/NeMo-Retriever Style)

This is the format used by NeMo retriever pipelines where documents live in a separate corpus and training examples reference documents by ID.

Training file example (single JSON):

1 {
2   "corpus": [
3     { "path": "/abs/path/to/wiki_corpus" }
4   ],
5   "data": [
6     {
7       "question_id": "q_001",
8       "question": "Explain transformers",
9       "corpus_id": "wiki_corpus",
10       "pos_doc": [{ "id": "d_123" }],
11       "neg_doc": [{ "id": "d_456" }, "d_789"]
12     }
13   ]
14 }

Corpus requirements

Each corpus directory must contain a merlin_metadata.json file.

Minimal example:

1 { "class": "TextQADataset", "corpus_id": "wiki_corpus" }

pos_doc and neg_doc can be lists of {"id": ...} dicts or raw IDs (they are normalized internally).
If you set use_dataset_instruction: true, optional fields like query_instruction and passage_instruction in merlin_metadata.json are surfaced to the collator.

Inline-Text JSONL (No Corpus Required)

This is convenient for custom fine-tuning pipelines where the documents are included inline.

JSONL example (one example per line):

1 {"query":"Explain transformers","pos_doc":"Transformers are a type of neural network...","neg_doc":["RNNs are...","CNNs are..."]}
2 {"query":"What is Python?","pos_doc":["A programming language."],"neg_doc":"A snake."}

query is accepted (question is also accepted as an alias).
pos_doc and neg_doc can be either:
- strings (interpreted as document text), or
- lists of strings, or
- dicts with at least text (optionally image, nr_ocr) for multimodal use cases.
If corpus_id is not provided, it defaults to __inline__.
use_dataset_instruction: true has no effect for pure inline records (instructions come from corpus metadata).

YAML Usage (Dataset + Collator)

Use the dataset factory plus the bi-encoder collator:

1 dataloader:
2   _target_: torchdata.stateful_dataloader.StatefulDataLoader
3   dataset:
4     _target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
5     data_dir_list:
6       - /abs/path/to/train.jsonl   # or train.json (corpus-id format)
7     data_type: train
8     n_passages: 5                 # 1 positive + 4 negatives
9     do_shuffle: true
10     use_dataset_instruction: false
11   collate_fn:
12     _target_: nemo_automodel.components.datasets.llm.BiEncoderCollator
13     q_max_len: 512
14     p_max_len: 512
15     query_prefix: "query:"
16     passage_prefix: "passage:"
17     pad_to_multiple_of: 8

Requirements

pos_doc must be non-empty.
If training requests negatives (e.g., n_passages > 1), neg_doc must contain at least one document.