nemo_automodel.components.datasets.llm.retrieval_collator#

Module Contents#

Classes#

RetrievalBiencoderCollator

Collator for biencoder retrieval training.

Functions#

_unpack_doc_values

Unpack document lists into individual examples.

API#

nemo_automodel.components.datasets.llm.retrieval_collator._unpack_doc_values(
features: List[Dict[str, Any]],
) List[Dict[str, Any]]#

Unpack document lists into individual examples.

.. rubric:: Example

Input: [{‘input_ids’: [[1,2], [3,4]], ‘attention_mask’: [[1,1], [1,1]]}] Output: [{‘input_ids’: [1,2], ‘attention_mask’: [1,1]}, {‘input_ids’: [3,4], ‘attention_mask’: [1,1]}]

class nemo_automodel.components.datasets.llm.retrieval_collator.RetrievalBiencoderCollator(
tokenizer: transformers.PreTrainedTokenizerBase,
q_max_len: int = 512,
p_max_len: int = 512,
query_prefix: str = '',
passage_prefix: str = '',
padding: Union[bool, str, transformers.file_utils.PaddingStrategy] = True,
pad_to_multiple_of: int = None,
)#

Collator for biencoder retrieval training.

This collator handles tokenization of queries and documents at batch time, which is more memory-efficient than pre-tokenization and allows for dynamic padding based on batch max length.

Based on BiencoderCollator from nemo-retriever-research but adapted for Automodel.

Initialization

Initialize the collator.

Parameters:
  • tokenizer – Tokenizer to use for encoding

  • q_max_len – Maximum length for queries

  • p_max_len – Maximum length for passages

  • query_prefix – Prefix to add to queries (e.g., “query: “)

  • passage_prefix – Prefix to add to passages (e.g., “passage: “)

  • padding – Padding strategy (“longest”, “max_length”, or “do_not_pad”)

  • pad_to_multiple_of – Pad to multiple of this value (e.g., 8 for FP16)

__call__(
batch: List[Dict[str, Any]],
) Dict[str, torch.Tensor]#

Collate a batch of examples.

Parameters:

batch – List of examples, each with ‘question’, ‘doc_text’, ‘doc_image’ keys

Returns:

  • q_input_ids: Query input IDs [batch_size, q_seq_len]

  • q_attention_mask: Query attention mask [batch_size, q_seq_len]

  • d_input_ids: Document input IDs [batch_size * num_docs, d_seq_len]

  • d_attention_mask: Document attention mask [batch_size * num_docs, d_seq_len]

  • labels: Dummy labels for compatibility [batch_size]

Return type:

Dictionary with

_merge_batch_dict(
query_batch_dict: Dict[str, List],
doc_batch_dict: Dict[str, List],
train_n_passages: int,
) Dict[str, List]#

Merge query and document batches into a single dictionary.

Adapted from nemo-retriever-research/src/loaders/loader_utils.py

_convert_dict_to_list(
input_dict: Dict[str, List],
) List[Dict[str, Any]]#

Convert dictionary of lists to list of dictionaries.

.. rubric:: Example

Input: {‘a’: [1, 2], ‘b’: [3, 4]} Output: [{‘a’: 1, ‘b’: 3}, {‘a’: 2, ‘b’: 4}]