nemo_automodel.components.datasets.llm.retrieval_dataset_inline

View as Markdown

Module Contents

Functions

NameDescription
_coerce_to_list-
_create_cross_encoder_transform_funcCreate transform function with specified number of negative documents.
_create_retrieval_transform_funcCreate transform function with specified number of negative documents.
_cross_encoder_transform_funcTransform function to convert from raw format to cross-encoder training format.
_load_json_or_jsonlLoad a JSON file, falling back to JSONL (one JSON object per line).
_normalize_inline_docNormalize an inline doc (text/image provided) into a canonical dict shape.
_resolve_doc_to_exampleResolve a doc reference into an example dict with keys: text, image, nr_ocr.
_retrieval_transform_funcTransform function to convert from raw format to training format.
flatten_bi_encoder_to_cross_encoderFlatten grouped bi-encoder output into cross-encoder format.
load_datasetsLoad retrieval datasets from JSON/JSONL files.
make_retrieval_datasetLoad and return dataset in retrieval format for encoder training.

Data

INLINE_CORPUS_ID

API

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._coerce_to_list(
value: typing.Any
) -> list
nemo_automodel.components.datasets.llm.retrieval_dataset_inline._create_cross_encoder_transform_func(
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False
)

Create transform function with specified number of negative documents.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._create_retrieval_transform_func(
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False
)

Create transform function with specified number of negative documents.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._cross_encoder_transform_func(
examples,
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False
)

Transform function to convert from raw format to cross-encoder training format.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._load_json_or_jsonl(
path: str
) -> typing.Union[dict, list]

Load a JSON file, falling back to JSONL (one JSON object per line).

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._normalize_inline_doc(
doc: typing.Any
) -> typing.Dict[str, typing.Any]

Normalize an inline doc (text/image provided) into a canonical dict shape.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._resolve_doc_to_example(
doc: typing.Any
) -> dict

Resolve a doc reference into an example dict with keys: text, image, nr_ocr.

Supported doc forms:

  • str: interpreted as inline document text
  • dict: must include text (optionally image, nr_ocr)
nemo_automodel.components.datasets.llm.retrieval_dataset_inline._retrieval_transform_func(
examples,
num_neg_docs,
corpus_dict,
use_dataset_instruction: bool = False
)

Transform function to convert from raw format to training format. Args: examples: Batch of examples with question, corpus_id, pos_doc, neg_doc num_neg_docs: Number of negative documents to use corpus_dict: Dictionary mapping corpus_id to corpus objects use_dataset_instruction: Whether to use instruction from dataset’s metadata

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.flatten_bi_encoder_to_cross_encoder(
data: dict
) -> dict

Flatten grouped bi-encoder output into cross-encoder format.

Takes bi-encoder-style data (queries with grouped doc lists) and flattens it so each query-doc pair becomes a separate entry. Used by cross-encoder transforms in both retrieval_dataset.py and retrieval_dataset_inline.py.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.load_datasets(
data_dir_list: typing.Union[typing.List[str], str],
concatenate: bool = True
)

Load retrieval datasets from JSON/JSONL files.

Copied from nemo-retriever-research/src/data/datasets.py

Returns:

Tuple of (dataset, corpus_dict)

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.make_retrieval_dataset(
data_dir_list: typing.Union[typing.List[str], str],
model_type: str = 'bi_encoder',
data_type: str = 'train',
n_passages: int = 5,
eval_negative_size: int = None,
seed: int = 42,
do_shuffle: bool = False,
max_train_samples: int = None,
train_data_select_offset: int = 0,
use_dataset_instruction: bool = False
)

Load and return dataset in retrieval format for encoder training.

This function loads data from JSON files and returns it ready for training. Uses set_transform() for lazy evaluation - tokenization is handled by collator.

Parameters:

data_dir_list
Union[List[str], str]

Path(s) to JSON file(s) containing training data

model_type
strDefaults to 'bi_encoder'

“bi_encoder” (default) or “cross_encoder”

data_type
strDefaults to 'train'

Type of data (“train” or “eval”)

n_passages
intDefaults to 5

Number of passages (1 positive + n-1 negatives)

eval_negative_size
intDefaults to None

Number of negative documents for evaluation

seed
intDefaults to 42

Random seed for reproducibility (for shuffling if needed)

do_shuffle
boolDefaults to False

Whether to shuffle the dataset

max_train_samples
intDefaults to None

Maximum number of training samples to use

train_data_select_offset
intDefaults to 0

Offset for selecting training samples

Returns:

A HuggingFace Dataset where each example is a dict with keys:

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.INLINE_CORPUS_ID = '__inline__'