Retrieval Dataset (Embedding Fine-tuning)
Retrieval Dataset (Embedding Fine-tuning)
NeMo Automodel supports retrieval model fine-tuning using a retrieval-style dataset: each training example is a query paired with one positive document and one or more negative documents.
This dataset is used by the retrieval recipes (see examples/retrieval/bi_encoder/ and examples/retrieval/cross_encoder/) together with the BiEncoderCollator.
What the Bi-Encoder Consumes
The dataset factory nemo_automodel.components.datasets.llm.make_retrieval_dataset returns a Hugging Face datasets.Dataset. At runtime it transforms each raw record into the training-time schema:
question: query stringdoc_text: list of document texts in the order[positive, negative_1, negative_2, ...]doc_image: list of images (or empty strings), aligned withdoc_textquery_instruction/passage_instruction: optional, used whenuse_dataset_instruction: trueand the corpus provides instructions via metadata
Supported Input Formats
NeMo Automodel supports two input schemas:
Corpus ID-Based JSON (Merlin/NeMo-Retriever Style)
This is the format used by NeMo retriever pipelines where documents live in a separate corpus and training examples reference documents by ID.
Training file example (single JSON):
Corpus requirements
Each corpus directory must contain a merlin_metadata.json file.
Minimal example:
pos_docandneg_doccan be lists of{"id": ...}dicts or raw IDs (they are normalized internally).- If you set
use_dataset_instruction: true, optional fields likequery_instructionandpassage_instructioninmerlin_metadata.jsonare surfaced to the collator.
Inline-Text JSONL (No Corpus Required)
This is convenient for custom fine-tuning pipelines where the documents are included inline.
JSONL example (one example per line):
queryis accepted (questionis also accepted as an alias).pos_docandneg_doccan be either:- strings (interpreted as document text), or
- lists of strings, or
- dicts with at least
text(optionallyimage,nr_ocr) for multimodal use cases.
- If
corpus_idis not provided, it defaults to__inline__. use_dataset_instruction: truehas no effect for pure inline records (instructions come from corpus metadata).
YAML Usage (Dataset + Collator)
Use the dataset factory plus the bi-encoder collator:
Requirements
pos_docmust be non-empty.- If training requests negatives (e.g.,
n_passages > 1),neg_docmust contain at least one document.