Retrieval Dataset (Embedding Fine-tuning)

View as Markdown

NeMo Automodel supports retrieval model fine-tuning using a retrieval-style dataset: each training example is a query paired with one positive document and one or more negative documents.

This dataset is used by the retrieval recipes (see examples/retrieval/bi_encoder/ and examples/retrieval/cross_encoder/) together with the BiEncoderCollator.

What the Bi-Encoder Consumes

The dataset factory nemo_automodel.components.datasets.llm.make_retrieval_dataset returns a Hugging Face datasets.Dataset. At runtime it transforms each raw record into the training-time schema:

  • question: query string
  • doc_text: list of document texts in the order [positive, negative_1, negative_2, ...]
  • doc_image: list of images (or empty strings), aligned with doc_text
  • query_instruction / passage_instruction: optional, used when use_dataset_instruction: true and the corpus provides instructions via metadata

Supported Input Formats

NeMo Automodel supports two input schemas:

Corpus ID-Based JSON (Merlin/NeMo-Retriever Style)

This is the format used by NeMo retriever pipelines where documents live in a separate corpus and training examples reference documents by ID.

Training file example (single JSON):

1{
2 "corpus": [
3 { "path": "/abs/path/to/wiki_corpus" }
4 ],
5 "data": [
6 {
7 "question_id": "q_001",
8 "question": "Explain transformers",
9 "corpus_id": "wiki_corpus",
10 "pos_doc": [{ "id": "d_123" }],
11 "neg_doc": [{ "id": "d_456" }, "d_789"]
12 }
13 ]
14}

Corpus requirements

Each corpus directory must contain a merlin_metadata.json file.

Minimal example:

1{ "class": "TextQADataset", "corpus_id": "wiki_corpus" }
  • pos_doc and neg_doc can be lists of {"id": ...} dicts or raw IDs (they are normalized internally).
  • If you set use_dataset_instruction: true, optional fields like query_instruction and passage_instruction in merlin_metadata.json are surfaced to the collator.

Inline-Text JSONL (No Corpus Required)

This is convenient for custom fine-tuning pipelines where the documents are included inline.

JSONL example (one example per line):

1{"query":"Explain transformers","pos_doc":"Transformers are a type of neural network...","neg_doc":["RNNs are...","CNNs are..."]}
2{"query":"What is Python?","pos_doc":["A programming language."],"neg_doc":"A snake."}
  • query is accepted (question is also accepted as an alias).
  • pos_doc and neg_doc can be either:
    • strings (interpreted as document text), or
    • lists of strings, or
    • dicts with at least text (optionally image, nr_ocr) for multimodal use cases.
  • If corpus_id is not provided, it defaults to __inline__.
  • use_dataset_instruction: true has no effect for pure inline records (instructions come from corpus metadata).

YAML Usage (Dataset + Collator)

Use the dataset factory plus the bi-encoder collator:

1dataloader:
2 _target_: torchdata.stateful_dataloader.StatefulDataLoader
3 dataset:
4 _target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
5 data_dir_list:
6 - /abs/path/to/train.jsonl # or train.json (corpus-id format)
7 data_type: train
8 n_passages: 5 # 1 positive + 4 negatives
9 do_shuffle: true
10 use_dataset_instruction: false
11 collate_fn:
12 _target_: nemo_automodel.components.datasets.llm.BiEncoderCollator
13 q_max_len: 512
14 p_max_len: 512
15 query_prefix: "query:"
16 passage_prefix: "passage:"
17 pad_to_multiple_of: 8

Requirements

  • pos_doc must be non-empty.
  • If training requests negatives (e.g., n_passages > 1), neg_doc must contain at least one document.