nemo_automodel.components.datasets.llm.retrieval_dataset_inline#

Module Contents#

Functions#

_load_json_or_jsonl

Load a JSON file, falling back to JSONL (one JSON object per line).

_coerce_to_list

_normalize_inline_doc

Normalize an inline doc (text/image provided) into a canonical dict shape.

_resolve_doc_to_example

Resolve a doc reference into an example dict with keys: text, image, nr_ocr.

load_datasets

Load retrieval datasets from JSON/JSONL files.

_retrieval_transform_func

Transform function to convert from raw format to training format.

flatten_bi_encoder_to_cross_encoder

Flatten grouped bi-encoder output into cross-encoder format.

_cross_encoder_transform_func

Transform function to convert from raw format to cross-encoder training format.

_create_retrieval_transform_func

Create transform function with specified number of negative documents.

_create_cross_encoder_transform_func

Create transform function with specified number of negative documents.

make_retrieval_dataset

Load and return dataset in retrieval format for encoder training.

Data#

API#

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.INLINE_CORPUS_ID#

inline

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._load_json_or_jsonl(path: str) Union[dict, list]#

Load a JSON file, falling back to JSONL (one JSON object per line).

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._coerce_to_list(value: Any) list#
nemo_automodel.components.datasets.llm.retrieval_dataset_inline._normalize_inline_doc(doc: Any) Dict[str, Any]#

Normalize an inline doc (text/image provided) into a canonical dict shape.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._resolve_doc_to_example(doc: Any) dict#

Resolve a doc reference into an example dict with keys: text, image, nr_ocr.

Supported doc forms:

  • str: interpreted as inline document text

  • dict: must include text (optionally image, nr_ocr)

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.load_datasets(
data_dir_list: Union[List[str], str],
concatenate: bool = True,
)#

Load retrieval datasets from JSON/JSONL files.

Copied from nemo-retriever-research/src/data/datasets.py

Returns:

Tuple of (dataset, corpus_dict)

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._retrieval_transform_func(
examples,
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False,
)#

Transform function to convert from raw format to training format.

Parameters:
  • examples – Batch of examples with question, corpus_id, pos_doc, neg_doc

  • num_neg_docs – Number of negative documents to use

  • corpus_dict – Dictionary mapping corpus_id to corpus objects

  • use_dataset_instruction – Whether to use instruction from dataset’s metadata

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.flatten_bi_encoder_to_cross_encoder(data: dict) dict#

Flatten grouped bi-encoder output into cross-encoder format.

Takes bi-encoder-style data (queries with grouped doc lists) and flattens it so each query-doc pair becomes a separate entry. Used by cross-encoder transforms in both retrieval_dataset.py and retrieval_dataset_inline.py.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._cross_encoder_transform_func(
examples,
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False,
)#

Transform function to convert from raw format to cross-encoder training format.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._create_retrieval_transform_func(
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False,
)#

Create transform function with specified number of negative documents.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._create_cross_encoder_transform_func(
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False,
)#

Create transform function with specified number of negative documents.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.make_retrieval_dataset(
data_dir_list: Union[List[str], str],
model_type: str = 'bi_encoder',
data_type: str = 'train',
n_passages: int = 5,
eval_negative_size: int = None,
seed: int = 42,
do_shuffle: bool = False,
max_train_samples: int = None,
train_data_select_offset: int = 0,
use_dataset_instruction: bool = False,
)#

Load and return dataset in retrieval format for encoder training.

This function loads data from JSON files and returns it ready for training. Uses set_transform() for lazy evaluation - tokenization is handled by collator.

Parameters:
  • data_dir_list – Path(s) to JSON file(s) containing training data

  • model_type – “bi_encoder” (default) or “cross_encoder”

  • data_type – Type of data (“train” or “eval”)

  • n_passages – Number of passages (1 positive + n-1 negatives)

  • eval_negative_size – Number of negative documents for evaluation

  • seed – Random seed for reproducibility (for shuffling if needed)

  • do_shuffle – Whether to shuffle the dataset

  • max_train_samples – Maximum number of training samples to use

  • train_data_select_offset – Offset for selecting training samples

Returns:

  • ‘question’: Query text

  • ’doc_text’: List of document texts [positive, negatives…]

  • ’doc_image’: List of images or empty strings

Return type:

A HuggingFace Dataset where each example is a dict with keys

.. note::

Tokenization should be handled by a collator (e.g., BiEncoderCollator) which is more efficient for batch padding and supports dynamic processing.