Retrieval Dataset (Embedding Fine-Tuning)
Retrieval Dataset (Embedding Fine-Tuning)
NeMo AutoModel supports retrieval model fine-tuning using a retrieval-style dataset: each training example is a query paired with one positive document and one or more negative documents.
The retrieval recipes use this dataset with the BiEncoderCollator. Example implementations are in examples/retrieval/bi_encoder/ and examples/retrieval/cross_encoder/.
What the Bi-Encoder Consumes
The dataset factory nemo_automodel.components.datasets.llm.make_retrieval_dataset returns a Hugging Face datasets.Dataset. At runtime, it transforms each raw record into the training-time schema:
question: query stringdoc_text: list of document texts in the order[positive, negative_1, negative_2, ...]doc_image: list of images (or empty strings), aligned withdoc_textquery_instruction/passage_instruction: optional, used whenuse_dataset_instruction: trueand the corpus provides instructions through metadata
Supported Input Formats
NeMo AutoModel supports two input schemas:
Corpus ID-Based JSON (Merlin/NeMo-Retriever Style)
This is the format used by NeMo retriever pipelines where documents live in a separate corpus and training examples reference documents by ID.
Training File Example (Single JSON)
Corpus Requirements
Each corpus directory must contain a merlin_metadata.json file.
Minimal example:
pos_docandneg_doccan be lists of{"id": ...}dicts or raw IDs (they are normalized internally).cycle_positive_docsdefaults tofalse, which always uses the first positive document. When a training record has multiple positive documents, setcycle_positive_docs: trueto rotate through them deterministically by epoch order (epoch % len(pos_doc)).- If you set
use_dataset_instruction: true, optional fields likequery_instructionandpassage_instructioninmerlin_metadata.jsonare surfaced to the collator.
Inline-Text JSONL (No Corpus Required)
This is convenient for custom fine-tuning pipelines where the documents are included inline.
The format uses one JSON object per line:
queryis accepted (questionis also accepted as an alias).pos_docandneg_doccan be either:- strings (interpreted as document text), or
- lists of strings, or
- dicts with at least
text(optionallyimage,nr_ocr) for multimodal use cases.
- If
corpus_idis not provided, it defaults to__inline__. use_dataset_instruction: truehas no effect for pure inline records (instructions come from corpus metadata).
Configure the Dataset and Collator in YAML
Use the dataset factory plus the bi-encoder collator:
Requirements
pos_docmust be non-empty.- If training requests negatives (e.g.,
n_passages > 1),neg_docmust contain at least one document.