nemo_automodel.components.datasets.llm.retrieval_dataset_inline

Module Contents

Functions

Name	Description
`_coerce_to_list`	-
`_create_cross_encoder_transform_func`	Create transform function with specified number of negative documents.
`_create_retrieval_transform_func`	Create transform function with specified number of negative documents.
`_cross_encoder_transform_func`	Transform function to convert from raw format to cross-encoder training format.
`_load_json_or_jsonl`	Load a JSON file, falling back to JSONL (one JSON object per line).
`_normalize_inline_doc`	Normalize an inline doc (text/image provided) into a canonical dict shape.
`_resolve_doc_to_example`	Resolve a doc reference into an example dict with keys: text, image, nr_ocr.
`_retrieval_transform_func`	Transform function to convert from raw format to training format.
`flatten_bi_encoder_to_cross_encoder`	Flatten grouped bi-encoder output into cross-encoder format.
`load_datasets`	Load retrieval datasets from JSON/JSONL files.
`make_retrieval_dataset`	Load and return dataset in retrieval format for encoder training.

Data

INLINE_CORPUS_ID

API

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._coerce_to_list(
    value: typing.Any
) -> list

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._create_cross_encoder_transform_func(
    num_neg_docs,
    corpus_dict,
    use_dataset_instruction: bool = False
)

Create transform function with specified number of negative documents.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._create_retrieval_transform_func(
    num_neg_docs,
    corpus_dict,
    use_dataset_instruction: bool = False
)

Create transform function with specified number of negative documents.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._cross_encoder_transform_func(
    examples,
    num_neg_docs,
    corpus_dict,
    use_dataset_instruction: bool = False
)

Transform function to convert from raw format to cross-encoder training format.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._load_json_or_jsonl(
    path: str
) -> typing.Union[dict, list]

Load a JSON file, falling back to JSONL (one JSON object per line).

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._normalize_inline_doc(
    doc: typing.Any
) -> typing.Dict[str, typing.Any]

Normalize an inline doc (text/image provided) into a canonical dict shape.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._resolve_doc_to_example(
    doc: typing.Any
) -> dict

Resolve a doc reference into an example dict with keys: text, image, nr_ocr.

Supported doc forms:

str: interpreted as inline document text
dict: must include text (optionally image, nr_ocr)

nemo_automodel.components.datasets.llm.retrieval_dataset_inline._retrieval_transform_func(
    examples,
    num_neg_docs,
    corpus_dict,
    use_dataset_instruction: bool = False
)

Transform function to convert from raw format to training format. Args: examples: Batch of examples with question, corpus_id, pos_doc, neg_doc num_neg_docs: Number of negative documents to use corpus_dict: Dictionary mapping corpus_id to corpus objects use_dataset_instruction: Whether to use instruction from dataset’s metadata

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.flatten_bi_encoder_to_cross_encoder(
    data: dict
) -> dict

Flatten grouped bi-encoder output into cross-encoder format.

Takes bi-encoder-style data (queries with grouped doc lists) and flattens it so each query-doc pair becomes a separate entry. Used by cross-encoder transforms in both retrieval_dataset.py and retrieval_dataset_inline.py.

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.load_datasets(
    data_dir_list: typing.Union[typing.List[str], str],
    concatenate: bool = True
)

Load retrieval datasets from JSON/JSONL files.

Copied from nemo-retriever-research/src/data/datasets.py

Returns:

Tuple of (dataset, corpus_dict)

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.make_retrieval_dataset(
    data_dir_list: typing.Union[typing.List[str], str],
    model_type: str = 'bi_encoder',
    data_type: str = 'train',
    n_passages: int = 5,
    eval_negative_size: int = None,
    seed: int = 42,
    do_shuffle: bool = False,
    max_train_samples: int = None,
    train_data_select_offset: int = 0,
    use_dataset_instruction: bool = False
)

Load and return dataset in retrieval format for encoder training.

This function loads data from JSON files and returns it ready for training. Uses set_transform() for lazy evaluation - tokenization is handled by collator.

Parameters:

data_dir_list

Union[List[str], str]

Path(s) to JSON file(s) containing training data

model_type

strDefaults to 'bi_encoder'

“bi_encoder” (default) or “cross_encoder”

data_type

strDefaults to 'train'

Type of data (“train” or “eval”)

n_passages

intDefaults to 5

Number of passages (1 positive + n-1 negatives)

eval_negative_size

intDefaults to None

Number of negative documents for evaluation

seed

intDefaults to 42

Random seed for reproducibility (for shuffling if needed)

do_shuffle

boolDefaults to False

Whether to shuffle the dataset

max_train_samples

intDefaults to None

Maximum number of training samples to use

train_data_select_offset

intDefaults to 0

Offset for selecting training samples

Returns:

A HuggingFace Dataset where each example is a dict with keys:

nemo_automodel.components.datasets.llm.retrieval_dataset_inline.INLINE_CORPUS_ID = '__inline__'