`nemo_automodel.components.datasets.llm.retrieval_dataset_inline`#

Module Contents#

Functions#

`_load_json_or_jsonl`	Load a JSON file, falling back to JSONL (one JSON object per line).
`_coerce_to_list`
`_normalize_inline_doc`	Normalize an inline doc (text/image provided) into a canonical dict shape.
`_resolve_doc_to_example`	Resolve a doc reference into an example dict with keys: text, image, nr_ocr.
`load_datasets`	Load retrieval datasets from JSON/JSONL files.
`_retrieval_transform_func`	Transform function to convert from raw format to training format.
`flatten_bi_encoder_to_cross_encoder`	Flatten grouped bi-encoder output into cross-encoder format.
`_cross_encoder_transform_func`	Transform function to convert from raw format to cross-encoder training format.
`_create_retrieval_transform_func`	Create transform function with specified number of negative documents.
`_create_cross_encoder_transform_func`	Create transform function with specified number of negative documents.
`make_retrieval_dataset`	Load and return dataset in retrieval format for encoder training.

Data#

INLINE_CORPUS_ID

API#

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.INLINE_CORPUS_ID#: ‘inline’

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._load_json_or_jsonl(path: str) → Union[dict, list]#: Load a JSON file, falling back to JSONL (one JSON object per line).

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._coerce_to_list(value: Any) → list#

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._normalize_inline_doc(doc: Any) → Dict[str, Any]#: Normalize an inline doc (text/image provided) into a canonical dict shape.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._resolve_doc_to_example(doc: Any) → dict#

Resolve a doc reference into an example dict with keys: text, image, nr_ocr.

Supported doc forms:

str: interpreted as inline document text
dict: must include text (optionally image, nr_ocr)

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.load_datasets( data_dir_list: Union[List[str], str], concatenate: bool = True, )#

Load retrieval datasets from JSON/JSONL files.

Copied from nemo-retriever-research/src/data/datasets.py

Returns:: Tuple of (dataset, corpus_dict)

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._retrieval_transform_func( examples, num_neg_docs, corpus_dict, use_dataset_instruction: bool = False, )#

Transform function to convert from raw format to training format.

Parameters:

examples – Batch of examples with question, corpus_id, pos_doc, neg_doc
num_neg_docs – Number of negative documents to use
corpus_dict – Dictionary mapping corpus_id to corpus objects
use_dataset_instruction – Whether to use instruction from dataset’s metadata

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.flatten_bi_encoder_to_cross_encoder(data: dict) → dict#

Flatten grouped bi-encoder output into cross-encoder format.

Takes bi-encoder-style data (queries with grouped doc lists) and flattens it so each query-doc pair becomes a separate entry. Used by cross-encoder transforms in both retrieval_dataset.py and retrieval_dataset_inline.py.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._cross_encoder_transform_func( examples, num_neg_docs, corpus_dict, use_dataset_instruction: bool = False, )#: Transform function to convert from raw format to cross-encoder training format.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._create_retrieval_transform_func( num_neg_docs, corpus_dict, use_dataset_instruction: bool = False, )#: Create transform function with specified number of negative documents.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._create_cross_encoder_transform_func( num_neg_docs, corpus_dict, use_dataset_instruction: bool = False, )#: Create transform function with specified number of negative documents.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.make_retrieval_dataset( data_dir_list: Union[List[str], str], model_type: str = 'bi_encoder', data_type: str = 'train', n_passages: int = 5, eval_negative_size: int = None, seed: int = 42, do_shuffle: bool = False, max_train_samples: int = None, train_data_select_offset: int = 0, use_dataset_instruction: bool = False, )#

Load and return dataset in retrieval format for encoder training.

This function loads data from JSON files and returns it ready for training. Uses set_transform() for lazy evaluation - tokenization is handled by collator.

Parameters:

data_dir_list – Path(s) to JSON file(s) containing training data
model_type – “bi_encoder” (default) or “cross_encoder”
data_type – Type of data (“train” or “eval”)
n_passages – Number of passages (1 positive + n-1 negatives)
eval_negative_size – Number of negative documents for evaluation
seed – Random seed for reproducibility (for shuffling if needed)
do_shuffle – Whether to shuffle the dataset
max_train_samples – Maximum number of training samples to use
train_data_select_offset – Offset for selecting training samples

Returns:

‘question’: Query text
’doc_text’: List of document texts [positive, negatives…]
’doc_image’: List of images or empty strings

Return type:

A HuggingFace Dataset where each example is a dict with keys

.. note::

Tokenization should be handled by a collator (e.g., BiEncoderCollator) which is more efficient for batch padding and supports dynamic processing.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline#

Module Contents#

Functions#

Data#

API#

`nemo_automodel.components.datasets.llm.retrieval_dataset_inline`#