nemo_automodel.components.datasets.llm.retrieval_dataset_inline
nemo_automodel.components.datasets.llm.retrieval_dataset_inline
Module Contents
Functions
Data
API
Create transform function with specified number of negative documents.
Create transform function with specified number of negative documents.
Transform function to convert from raw format to cross-encoder training format.
Load a JSON file, falling back to JSONL (one JSON object per line).
Normalize an inline doc (text/image provided) into a canonical dict shape.
Resolve a doc reference into an example dict with keys: text, image, nr_ocr.
Supported doc forms:
str: interpreted as inline document textdict: must includetext(optionallyimage,nr_ocr)
Transform function to convert from raw format to training format. Args: examples: Batch of examples with question, corpus_id, pos_doc, neg_doc num_neg_docs: Number of negative documents to use corpus_dict: Dictionary mapping corpus_id to corpus objects use_dataset_instruction: Whether to use instruction from dataset’s metadata
Flatten grouped bi-encoder output into cross-encoder format.
Takes bi-encoder-style data (queries with grouped doc lists) and flattens it so each query-doc pair becomes a separate entry. Used by cross-encoder transforms in both retrieval_dataset.py and retrieval_dataset_inline.py.
Load retrieval datasets from JSON/JSONL files.
Copied from nemo-retriever-research/src/data/datasets.py
Returns:
Tuple of (dataset, corpus_dict)
Load and return dataset in retrieval format for encoder training.
This function loads data from JSON files and returns it ready for training. Uses set_transform() for lazy evaluation - tokenization is handled by collator.
Parameters:
Path(s) to JSON file(s) containing training data
“bi_encoder” (default) or “cross_encoder”
Type of data (“train” or “eval”)
Number of passages (1 positive + n-1 negatives)
Number of negative documents for evaluation
Random seed for reproducibility (for shuffling if needed)
Whether to shuffle the dataset
Maximum number of training samples to use
Offset for selecting training samples
Returns:
A HuggingFace Dataset where each example is a dict with keys: