nemo_automodel.components.datasets.llm.retrieval_collator

View as Markdown

Module Contents

Classes

NameDescription
BiEncoderCollatorCollator for encoder retrieval training.
CrossEncoderCollatorCollate query-document pairs for cross-encoder reranking.

Functions

NameDescription
_doc_id_str_to_int64Stable 63-bit int for corpus doc id strings (for in-batch duplicate masking).
_unpack_doc_valuesUnpack document lists into individual examples.

API

class nemo_automodel.components.datasets.llm.retrieval_collator.BiEncoderCollator(
tokenizer: transformers.PreTrainedTokenizerBase,
q_max_len: int = 512,
p_max_len: int = 512,
query_prefix: str = '',
passage_prefix: str = '',
padding: typing.Union[bool, str, transformers.file_utils.PaddingStrategy] = True,
pad_to_multiple_of: int = None,
use_dataset_instruction: bool = False
)

Collator for encoder retrieval training.

This collator handles tokenization of queries and documents at batch time, which is more memory-efficient than pre-tokenization and allows for dynamic padding based on batch max length.

Based on EncoderCollator from nemo-retriever-research but adapted for Automodel.

nemo_automodel.components.datasets.llm.retrieval_collator.BiEncoderCollator.__call__(
batch: typing.List[typing.Dict[str, typing.Any]]
) -> typing.Dict[str, torch.Tensor]

Collate a batch of examples.

Parameters:

batch
List[Dict[str, Any]]

List of examples, each with ‘question’, ‘doc_text’, ‘doc_image’ keys

Returns: Dict[str, torch.Tensor]

Dictionary with:

nemo_automodel.components.datasets.llm.retrieval_collator.BiEncoderCollator._convert_dict_to_list(
input_dict: typing.Dict[str, typing.List]
) -> typing.List[typing.Dict[str, typing.Any]]

Convert dictionary of lists to list of dictionaries.

nemo_automodel.components.datasets.llm.retrieval_collator.BiEncoderCollator._merge_batch_dict(
query_batch_dict: typing.Dict[str, typing.List],
doc_batch_dict: typing.Dict[str, typing.List],
train_n_passages: int
) -> typing.Dict[str, typing.List]

Merge query and document batches into a single dictionary.

Adapted from nemo-retriever-research/src/loaders/loader_utils.py

class nemo_automodel.components.datasets.llm.retrieval_collator.CrossEncoderCollator(
rerank_max_length: int,
args = (),
prompt_template: str = 'question:{query} \n \n pas...,
kwargs = {}
)

Bases: DataCollatorWithPadding

Collate query-document pairs for cross-encoder reranking.

nemo_automodel.components.datasets.llm.retrieval_collator.CrossEncoderCollator.__call__(
features: typing.List[typing.Dict[str, typing.Any]]
) -> transformers.BatchEncoding
nemo_automodel.components.datasets.llm.retrieval_collator._doc_id_str_to_int64(
doc_id: str
) -> int

Stable 63-bit int for corpus doc id strings (for in-batch duplicate masking).

nemo_automodel.components.datasets.llm.retrieval_collator._unpack_doc_values(
features: typing.List[typing.Dict[str, typing.Any]]
) -> typing.List[typing.Dict[str, typing.Any]]

Unpack document lists into individual examples.